7 Replies Latest reply on May 13, 2014 10:39 AM by jbertram

Repititive Fail Over

yairogen May 13, 2014 4:33 AM

I've successfully managed to see fail over work in replicated servers. However, I can't seem to make the following use case work (using the core API):

clients run against Primary. Secondary running in backup mode. Clients create service locator with both ip's in the map.

I kill primary. Backup starts working. Clients re-connet to it. all is well.

I restart Primary. Clients continue working with Backup.

I kill backup. Expected that clients re-connect to primary. However, they don't and I get this error:

Caused by: HornetQNotConnectedException[errorType=NOT_CONNECTED message=HQ119007: Cannot connect to server(s). Tried with all available servers.]
  at org.hornetq.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:901)

Any Ideas?

1. Re: Repititive Fail Over

ataylor May 13, 2014 5:16 AM (in response to yairogen)

your live server must be incorrectly configured, make sure it announces itself correctly and that it announces the correct connector to use.
Actions
2. Re: Repititive Fail Over

yairogen May 13, 2014 6:20 AM (in response to ataylor)

Which files can I attach so you can help me out?

Also - following up on https://community.jboss.org/thread/234727, could it be I just need to add the "max-saved-replicated-journal-size" parameter? Should that help?
Actions
3. Re: Repititive Fail Over

ataylor May 13, 2014 6:25 AM (in response to yairogen)

check in the server logs that the backup gets announced on live restart, check in your configuration that the cluster connection is announcing the correct connector. if bot these are correct then it may be env. you could try debugging your client and seeing what topology gets set.
Actions

4. Re: Re: Repititive Fail Over

yairogen May 13, 2014 7:08 AM (in response to ataylor)

live log:

07:13:06,890 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.122:5445
07:13:06,903 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.122:5455
07:13:06,921 INFO  [org.hornetq.core.server] HQ221007: Server is now live
07:13:06,921 INFO  [org.hornetq.core.server] HQ221001: HornetQ Server version 2.5.0.SNAPSHOT (Wild Hornet, 124) [705b9028-d9df-11e3-851c-8d2fc19a3b2f]
07:13:22,739 INFO  [org.hornetq.core.server] HQ221025: Replication: sending JournalFileImpl: (hornetq-data-1.hq id = 148, recordID = 148) (size=10,485,760) to backup. AIOSequentialFile:/opt/hornetq/installed/hornetq-2.4.0.Final/bin/../data/journal/hornetq-data-1.hq
07:13:22,943 INFO  [org.hornetq.core.server] HQ221025: Replication: sending JournalFileImpl: (hornetq-bindings-50.bindings id = 1, recordID = 1) (size=1,048,576) to backup. NIOSequentialFile ../data/bindings/hornetq-bindings-50.bindings
07:13:22,947 INFO  [org.hornetq.core.server] HQ221025: Replication: sending JournalFileImpl: (hornetq-bindings-59.bindings id = 2, recordID = 2) (size=1,048,576) to backup. NIOSequentialFile ../data/bindings/hornetq-bindings-59.bindings
07:13:22,968 INFO  [org.hornetq.core.server] HQ221025: Replication: sending JournalFileImpl: (hornetq-bindings-2.bindings id = 56, recordID = 56) (size=1,048,576) to backup. NIOSequentialFile ../data/bindings/hornetq-bindings-2.bindings

backup log:

07:13:22,111 INFO  [org.hornetq.core.server] HQ221109: HornetQ Backup Server version 2.5.0.SNAPSHOT (Wild Hornet, 124) [null] started, waiting live to fail before it gets active
07:13:23,933 INFO  [org.hornetq.core.server] HQ221024: Backup server HornetQServerImpl::serverUUID=705b9028-d9df-11e3-851c-8d2fc19a3b2f is synchronized with live-server.
07:13:26,923 INFO  [org.hornetq.core.server] HQ221031: backup announced

killing live yields this in the backup:

07:15:41,216 WARN  [org.hornetq.core.client] HQ212037: Connection failure has been detected: HQ119015: The connection was disconnected because of server shutdown [code=DISCONNECTED]
07:15:41,215 WARN  [org.hornetq.core.client] HQ212037: Connection failure has been detected: HQ119015: The connection was disconnected because of server shutdown [code=DISCONNECTED]
07:15:41,251 WARN  [org.hornetq.core.client] HQ212037: Connection failure has been detected: HQ119015: The connection was disconnected because of server shutdown [code=DISCONNECTED]
07:15:41,292 WARN  [org.hornetq.core.client] HQ212004: Failed to connect to server.
07:15:41,289 INFO  [org.hornetq.core.server] HQ221037: HornetQServerImpl::serverUUID=705b9028-d9df-11e3-851c-8d2fc19a3b2f to become 'live'
07:15:41,274 WARNING [io.netty.channel.AbstractChannel] Can't invoke task later as EventLoop rejected it: java.util.concurrent.RejectedExecutionException: event executor terminated
        at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:703) [netty.jar:4.0.13.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:296) [netty.jar:4.0.13.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:688) [netty.jar:4.0.13.Final]
        at io.netty.channel.AbstractChannel$AbstractUnsafe.invokeLater(AbstractChannel.java:727) [netty.jar:4.0.13.Final]
        at io.netty.channel.AbstractChannel$AbstractUnsafe.deregister(AbstractChannel.java:593) [netty.jar:4.0.13.Final]
        at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:565) [netty.jar:4.0.13.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadHandler.close(DefaultChannelPipeline.java:1018) [netty.jar:4.0.13.Final]
        at io.netty.channel.DefaultChannelHandlerContext.invokeClose(DefaultChannelHandlerContext.java:561) [netty.jar:4.0.13.Final]
        at io.netty.channel.DefaultChannelHandlerContext.close(DefaultChannelHandlerContext.java:546) [netty.jar:4.0.13.Final]
        at io.netty.channel.ChannelDuplexHandler.close(ChannelDuplexHandler.java:73) [netty.jar:4.0.13.Final]
        at io.netty.channel.DefaultChannelHandlerContext.invokeClose(DefaultChannelHandlerContext.java:561) [netty.jar:4.0.13.Final]
        at io.netty.channel.DefaultChannelHandlerContext.close(DefaultChannelHandlerContext.java:546) [netty.jar:4.0.13.Final]
        at io.netty.channel.DefaultChannelHandlerContext.close(DefaultChannelHandlerContext.java:424) [netty.jar:4.0.13.Final]
        at io.netty.channel.DefaultChannelPipeline.close(DefaultChannelPipeline.java:826) [netty.jar:4.0.13.Final]
        at io.netty.channel.AbstractChannel.close(AbstractChannel.java:178) [netty.jar:4.0.13.Final]
        at io.netty.channel.ChannelFutureListener$2.operationComplete(ChannelFutureListener.java:56) [netty.jar:4.0.13.Final]
        at io.netty.channel.ChannelFutureListener$2.operationComplete(ChannelFutureListener.java:52) [netty.jar:4.0.13.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:628) [netty.jar:4.0.13.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:593) [netty.jar:4.0.13.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:550) [netty.jar:4.0.13.Final]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:414) [netty.jar:4.0.13.Final]
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:251) [netty.jar:4.0.13.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:502) [netty.jar:4.0.13.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) [netty.jar:4.0.13.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) [netty.jar:4.0.13.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) [netty.jar:4.0.13.Final]
        at java.lang.Thread.run(Thread.java:722) [rt.jar:1.7.0_05]


07:15:42,190 INFO  [org.hornetq.core.server] HQ221003: trying to deploy queue jms.queue.DLQ
07:15:42,257 INFO  [org.hornetq.core.server] HQ221003: trying to deploy queue jms.queue.ExpiryQueue
07:15:42,444 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.123:5445
07:15:42,451 INFO  [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.123:5455

restarting live.

backup log - many more lines like this:

07:16:36,746 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:37,477 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:41,746 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:42,477 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:46,749 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:47,477 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:51,749 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:52,478 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:56,749 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:57,478 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:17:01,750 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f

live log:

07:16:36,720 INFO [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.122:5445
07:16:36,733 INFO [org.hornetq.core.server] HQ221020: Started Netty Acceptor version 4.0.13.Final 10.45.37.122:5455
07:16:36,752 INFO [org.hornetq.core.server] HQ221007: Server is now live
07:16:36,753 INFO [org.hornetq.core.server] HQ221001: HornetQ Server version 2.5.0.SNAPSHOT (Wild Hornet, 124) [705b9028-d9df-11e3-851c-8d2fc19a3b2f]
07:16:41,746 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:42,478 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:46,748 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f
07:16:47,478 WARN [org.hornetq.core.client] HQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=705b9028-d9df-11e3-851c-8d2fc19a3b2f

NOTE: I try deleting the data folder in both nodes. Same behavior. looks like after replication once it doesn't know it should be back as backup. Also, the "<backup>true</backup>" flag is only set on the backup configuration and not on the live configuration node. is that ok?

5. Re: Re: Repititive Fail Over

ataylor May 13, 2014 7:14 AM (in response to yairogen)

HornetQ User Manual
Actions
6. Re: Re: Repititive Fail Over

yairogen May 13, 2014 7:59 AM (in response to ataylor)

Just to verify I understand.

If I set "<allow-failback>true</allow-failback>" and "<check-for-live-server>true</check-for-live-server>" when primary goes back up it will detect the other live server using it's id and the newly started primary will kill the backup.

If I only set the "<check-for-live-server>true</check-for-live-server>" flag then the original primary server will detect there is an existing live server and it will take ownership and become a new backup instead.

Is that correct?
Actions
7. Re: Repititive Fail Over

jbertram May 13, 2014 10:39 AM (in response to yairogen)

Is that correct?

Yes.
Actions

Go to original post