7 Replies Latest reply on May 13, 2014 10:39 AM by Justin Bertram

    Repititive Fail Over

    Yair Ogen Expert

      I've successfully managed to see fail over work in replicated servers. However, I can't seem to make the following use case work (using the core API):

       

      clients run against Primary. Secondary running in backup mode. Clients create service locator with both ip's in the map.

       

      I kill primary. Backup starts working. Clients re-connet to it. all is well.

       

      I restart Primary. Clients continue working with Backup.

       

      I kill backup. Expected that clients re-connect to primary. However, they don't and I get this error:

       

      Caused by: HornetQNotConnectedException[errorType=NOT_CONNECTED message=HQ119007: Cannot connect to server(s). Tried with all available servers.]
        at org.hornetq.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:901)
      

       

      Any Ideas?

        • 1. Re: Repititive Fail Over
          Andy Taylor Master

          your live server must be incorrectly configured, make sure it announces itself correctly and that it announces the correct connector to use.

          • 2. Re: Repititive Fail Over
            Yair Ogen Expert

            Which files can I attach so you can help me out?

             

            Also - following up on https://community.jboss.org/thread/234727, could it be I just need to add the "max-saved-replicated-journal-size" parameter? Should that help?

            • 3. Re: Repititive Fail Over
              Andy Taylor Master

              check in the server logs that the backup gets announced on live restart, check in your configuration that the cluster connection is announcing the correct connector. if bot these are correct then it may be env. you could try debugging your client and seeing what topology gets set.

              • 4. Re: Re: Repititive Fail Over
                Yair Ogen Expert

                live log:


                07:13:06,890 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.122:5445
                07:13:06,903 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.122:5455
                07:13:06,921 INFO  [org.hornetq.core.server] HQ221007: Server is now live
                07:13:06,921 INFO  [org.hornetq.core.server] HQ221001: HornetQ Server version 2.5.0.SNAPSHOT (Wild Hornet, 124) [705b9028-d9df-11e3-851c-8d2fc19a3b2f]
                07:13:22,739 INFO  [org.hornetq.core.server] HQ221025: Replication: sending JournalFileImpl: (hornetq-data-1.hq id = 148, recordID = 148) (size=10,485,760) to backup. AIOSequentialFile:/opt/hornetq/installed/hornetq-2.4.0.Final/bin/../data/journal/hornetq-data-1.hq
                07:13:22,943 INFO  [org.hornetq.core.server] HQ221025: Replication: sending JournalFileImpl: (hornetq-bindings-50.bindings id = 1, recordID = 1) (size=1,048,576) to backup. NIOSequentialFile ../data/bindings/hornetq-bindings-50.bindings
                07:13:22,947 INFO  [org.hornetq.core.server] HQ221025: Replication: sending JournalFileImpl: (hornetq-bindings-59.bindings id = 2, recordID = 2) (size=1,048,576) to backup. NIOSequentialFile ../data/bindings/hornetq-bindings-59.bindings
                07:13:22,968 INFO  [org.hornetq.core.server] HQ221025: Replication: sending JournalFileImpl: (hornetq-bindings-2.bindings id = 56, recordID = 56) (size=1,048,576) to backup. NIOSequentialFile ../data/bindings/hornetq-bindings-2.bindings
                

                 

                backup log:

                07:13:22,111 INFO  [org.hornetq.core.server] HQ221109: HornetQ Backup Server version 2.5.0.SNAPSHOT (Wild Hornet, 124) [null] started, waiting live to fail before it gets active
                07:13:23,933 INFO  [org.hornetq.core.server] HQ221024: Backup server HornetQServerImpl::serverUUID=705b9028-d9df-11e3-851c-8d2fc19a3b2f is synchronized with live-server.
                07:13:26,923 INFO  [org.hornetq.core.server] HQ221031: backup announced
                

                 

                killing live yields this in the backup:

                 

                07:15:41,216 WARN  [org.hornetq.core.client] HQ212037: Connection failure has been detected: HQ119015: The connection was disconnected because of server shutdown [code=DISCONNECTED]
                07:15:41,215 WARN  [org.hornetq.core.client] HQ212037: Connection failure has been detected: HQ119015: The connection was disconnected because of server shutdown [code=DISCONNECTED]
                07:15:41,251 WARN  [org.hornetq.core.client] HQ212037: Connection failure has been detected: HQ119015: The connection was disconnected because of server shutdown [code=DISCONNECTED]
                07:15:41,292 WARN  [org.hornetq.core.client] HQ212004: Failed to connect to server.
                07:15:41,289 INFO  [org.hornetq.core.server] HQ221037: HornetQServerImpl::serverUUID=705b9028-d9df-11e3-851c-8d2fc19a3b2f to become 'live'
                07:15:41,274 WARNING [io.netty.channel.AbstractChannel] Can't invoke task later as EventLoop rejected it: java.util.concurrent.RejectedExecutionException: event executor terminated
                        at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:703) [netty.jar:4.0.13.Final]
                        at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:296) [netty.jar:4.0.13.Final]
                        at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:688) [netty.jar:4.0.13.Final]
                        at io.netty.channel.AbstractChannel$AbstractUnsafe.invokeLater(AbstractChannel.java:727) [netty.jar:4.0.13.Final]
                        at io.netty.channel.AbstractChannel$AbstractUnsafe.deregister(AbstractChannel.java:593) [netty.jar:4.0.13.Final]
                        at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:565) [netty.jar:4.0.13.Final]
                        at io.netty.channel.DefaultChannelPipeline$HeadHandler.close(DefaultChannelPipeline.java:1018) [netty.jar:4.0.13.Final]
                        at io.netty.channel.DefaultChannelHandlerContext.invokeClose(DefaultChannelHandlerContext.java:561) [netty.jar:4.0.13.Final]
                        at io.netty.channel.DefaultChannelHandlerContext.close(DefaultChannelHandlerContext.java:546) [netty.jar:4.0.13.Final]
                        at io.netty.channel.ChannelDuplexHandler.close(ChannelDuplexHandler.java:73) [netty.jar:4.0.13.Final]
                        at io.netty.channel.DefaultChannelHandlerContext.invokeClose(DefaultChannelHandlerContext.java:561) [netty.jar:4.0.13.Final]
                        at io.netty.channel.DefaultChannelHandlerContext.close(DefaultChannelHandlerContext.java:546) [netty.jar:4.0.13.Final]
                        at io.netty.channel.DefaultChannelHandlerContext.close(DefaultChannelHandlerContext.java:424) [netty.jar:4.0.13.Final]
                        at io.netty.channel.DefaultChannelPipeline.close(DefaultChannelPipeline.java:826) [netty.jar:4.0.13.Final]
                        at io.netty.channel.AbstractChannel.close(AbstractChannel.java:178) [netty.jar:4.0.13.Final]
                        at io.netty.channel.ChannelFutureListener$2.operationComplete(ChannelFutureListener.java:56) [netty.jar:4.0.13.Final]
                        at io.netty.channel.ChannelFutureListener$2.operationComplete(ChannelFutureListener.java:52) [netty.jar:4.0.13.Final]
                        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:628) [netty.jar:4.0.13.Final]
                        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:593) [netty.jar:4.0.13.Final]
                        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:550) [netty.jar:4.0.13.Final]
                        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:414) [netty.jar:4.0.13.Final]
                        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:251) [netty.jar:4.0.13.Final]
                        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:502) [netty.jar:4.0.13.Final]
                        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) [netty.jar:4.0.13.Final]
                        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) [netty.jar:4.0.13.Final]
                        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) [netty.jar:4.0.13.Final]
                        at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_05]
                
                
                07:15:42,190 INFO  [org.hornetq.core.server] HQ221003: trying to deploy queue jms.queue.DLQ
                07:15:42,257 INFO  [org.hornetq.core.server] HQ221003: trying to deploy queue jms.queue.ExpiryQueue
                07:15:42,444 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.123:5445
                07:15:42,451 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.123:5455
                
                

                 

                 

                restarting live.

                 

                backup log - many more lines like this:

                 

                07:16:36,746 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:37,477 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:41,746 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:42,477 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:46,749 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:47,477 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:51,749 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:52,478 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:56,749 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:57,478 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:17:01,750 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                

                 

                live log:

                 

                07:16:36,720 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.122:5445
                07:16:36,733 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.122:5455
                07:16:36,752 INFO  [org.hornetq.core.server] HQ221007: Server is now live
                07:16:36,753 INFO  [org.hornetq.core.server] HQ221001: HornetQ Server version 2.5.0.SNAPSHOT (Wild Hornet, 124) [705b9028-d9df-11e3-851c-8d2fc19a3b2f]
                07:16:41,746 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:42,478 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:46,748 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                07:16:47,478 WARN  [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
                

                 

                NOTE: I try deleting the data folder in both nodes. Same behavior. looks like after replication once it doesn't know it should be back as backup. Also, the "<backup>true</backup>" flag is only set on the backup configuration and not on the live configuration node. is that ok?

                • 6. Re: Re: Repititive Fail Over
                  Yair Ogen Expert

                  Just to verify I understand.

                   

                  If I set "<allow-failback>true</allow-failback>" and "<check-for-live-server>true</check-for-live-server>" when primary goes back up it will detect the other live server using it's id and the newly started primary will kill the backup.

                   

                  If I only set the "<check-for-live-server>true</check-for-live-server>" flag then the original primary server will detect there is an existing live server and it will take ownership and become a new backup instead.

                   

                  Is that correct?

                  • 7. Re: Repititive Fail Over
                    Justin Bertram Master

                    Is that correct?

                    Yes.