Cluster messages not redistributed after node hard kill
parmstrong Nov 9, 2010 6:24 PMSo we have a cluster going with 2 nodes using hornetQ integrated with Jboss5. We bring the two nodes up and they discover each other fine. I bring a client up and start sending messages to a queue that is clustered across the nodes. Everything will work great for a while the messages get load balanced and all is happy. Then I can get in a strange state in a couple of different ways, but the easiest way is if I go to one of the server nodes and just kill the server process. The one server appears to detect the other node went down and adjusts its cluster view but after this happens every other message I send to the server doesn't get processed. It looks like the message is placed on the bridge queue that was connecting the two nodes. The consumer count on the bridge queue is one but the other server is dead and so it never gets picked up by anything. I would like it if the other node dies that the node that is still running would grab the messages and start running them. In this case I can restart the other node and it starts picking the messages up when it comes back up but there are other situations where this same sort of thing will happen but even after a restart of the one server it never starts grabbing messages off the brigde any more. In that case I have to restart the entire cluster. So two questions:
1. How do I get it to handle when an node is hard killed so that live cluster node will start processing the messages meant for the other node.
2. Have you seen it where one of the nodes in a cluster will get into a state where it cant rejoin the cluster even after restart in a way that it will start processing messages again, and if so what is there to to about that?
I attached my hornetq-jms.xml for reference.
16:09:22,620 INFO [PerseusPartition] New cluster view for partition PerseusPartition (id: 1, delta: 1) : [10.4.16.63:1099, 10.4.16.64:1099]
16:09:22,625 INFO [PerseusPartition] I am (10.4.16.63:1099) received membershipChanged event:
16:09:22,625 INFO [PerseusPartition] Dead members: 0 ([])
16:09:22,625 INFO [PerseusPartition] New Members : 1 ([10.4.16.64:1099])
16:09:22,625 INFO [PerseusPartition] All Members : 2 ([10.4.16.63:1099, 10.4.16.64:1099])
16:09:22,912 INFO [RPCManagerImpl] Received new cluster view: [10.4.16.63:37679|1] [10.4.16.63:37679, 10.4.16.64:39101]
16:09:34,348 INFO [BridgeImpl] Connecting bridge sf.my-cluster.52417568-e2df-11df-b017-000c29922be7 to its destination
16:09:34,563 INFO [BridgeImpl] Bridge sf.my-cluster.52417568-e2df-11df-b017-000c29922be7 is connected to its destination
16:10:10,307 INFO [RPCManagerImpl] Received new cluster view: [10.4.16.63:49982|1] [10.4.16.63:49982, 10.4.16.64:57256]
16:10:31,535 INFO [PerseusPartition] Suspected member: 10.4.16.64:39101
16:10:31,586 INFO [RPCManagerImpl] Received new cluster view: [10.4.16.63:37679|2] [10.4.16.63:37679]
16:10:31,589 INFO [PerseusPartition] New cluster view for partition PerseusPartition (id: 2, delta: -1) : [10.4.16.63:1099]
16:10:31,590 INFO [PerseusPartition] I am (10.4.16.63:1099) received membershipChanged event:
16:10:31,590 INFO [PerseusPartition] Dead members: 1 ([10.4.16.64:1099])
16:10:31,590 INFO [PerseusPartition] New Members : 0 ([])
16:10:31,590 INFO [PerseusPartition] All Members : 1 ([10.4.16.63:1099])
16:10:31,684 INFO [RPCManagerImpl] Received new cluster view: [10.4.16.63:49982|2] [10.4.16.63:49982]
16:11:26,281 WARN [InterceptorsFactory] EJBTHREE-1246: Do not use InterceptorsFactory with a ManagedObjectAdvisor, InterceptorRegistry should be used via the bean container
16:11:26,281 WARN [InterceptorsFactory] EJBTHREE-1246: Do not use InterceptorsFactory with a ManagedObjectAdvisor, InterceptorRegistry should be used via the bean container
16:11:36,174 WARN [RemotingConnectionImpl] Connection failure has been detected: Did not receive ping from /10.4.16.64:60438. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. The connection will now be closed. [code=3]
16:11:36,175 WARN [ServerSessionImpl] Client connection failed, clearing up resources for session 6a17c798-ec56-11df-bb46-000c29922be7
16:11:36,175 WARN [ServerSessionImpl] Cleared up resources for session 6a17c798-ec56-11df-bb46-000c29922be7
16:11:36,176 WARN [ServerSessionPacketHandler] Client connection failed, clearing up resources for session 6a17c798-ec56-11df-bb46-000c29922be7
16:11:36,176 WARN [ServerSessionPacketHandler] Cleared up resources for session 6a17c798-ec56-11df-bb46-000c29922be7
16:12:05,769 ERROR [ServerThread] WorkerThread#0[10.4.11.211:64297] exception occurred during first invocation
-
hornetq-jms.xml 1.9 KB