Cluster synchronisation problem after nodes are restarted

thoemic Dec 14, 2015 10:10 AM

Hi,

I have a cluster with 8 nodes on WildFly 8.1.0.Final. All the nodes run in standalone mode. When restarting a couple of nodes, some of them are experiencing trouble communicating.

The problem can be reproduced as follows:

Start all the nodes. Cluster synchronisation works.
Kill node 3 and node 5
Start node 3
Start node 5
Send message to topic on node 5
--> Node 3 does not receive the message, but all the other nodes do.
Send message to topic on node 3
--> All the nodes receive the message
Send message to topic on the other nodes
--> All the nodes receive the message

Inspecting the log of node 3 reveales the following log entry which looks suspicious to me:

org.hornetq.core.server] (Thread-15 (HornetQ-client-global-threads--577881222)) HQ222139: MessageFlowRecordImpl [nodeID=8872394f-a1ce-11e5-8297-23a69032a478, connector=TransportConfiguration(name=netty, factory=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory) ?port=5446&host=172-17-0-125&use-nio=false, queueName=sf.my-cluster.8872394f-a1ce-11e5-8297-23a69032a478, queue=QueueImpl[name=sf.my-cluster.8872394f-a1ce-11e5-8297-23a69032a478, postOffice=PostOfficeImpl [server=HornetQServerImpl::serverUUID=4d8329e7-a26b-11e5-b9e9-690d2e81ad4e]]@2c2e499a, isClosed=false, firstReset=true]::Remote queue binding jms.topic.BsiCrmClusterSyncTopic7c6ce5fd-a26b-11e5-b28e-e329775e0980 has already been bound in the post office. Most likely cause for this is you have a loop in your cluster due to cluster max-hops being too large or you have multiple cluster connections to the same nodes using overlapping addresses

This log entry can only be found in the log of node 3.

Does someone know how to solve this issue? standalone.xml is attached.

Best,

Thomas

standalone.xml.zip 4.1 KB