4 Replies Latest reply on Jan 30, 2013 2:44 AM by swapnath

Problem with GMS joining nodes when master node was killed

felixreuthlinger May 31, 2010 6:37 AM

Hi,

I have a Problem starting up my JBoss 5.1.0.GA cluster again. I had one node running which was selected as master for GMS. Then I started two other nodes that joined the cluster at first. But then i had bring down all nodes again because of deployment errors. The one node that was selected as master node didn't shut gracefully, so I had to kill the java process.

After I did that every single node seems to be trying to connect to the former master node at the former port:

Log from the master node:

2010-05-31 12:23:31,445 WARN  [org.jgroups.protocols.pbcast.GMS] join(141.82.59.224:54939) sent to 141.82.59.224:32789 timed out (after 3000 ms), retrying
 2010-05-31 12:23:36,445 WARN  [org.jgroups.protocols.pbcast.GMS] join(141.82.59.224:54939) sent to 141.82.59.224:32789 timed out (after 3000 ms), retrying

Log from a random other node:

2010-05-31 12:07:49,581 WARN  [org.jgroups.protocols.pbcast.GMS] join(141.82.59.211:40331) sent to 141.82.59.224:32789 timed out (after 3000 ms), retrying
 2010-05-31 12:07:54,582 WARN  [org.jgroups.protocols.pbcast.GMS] join(141.82.59.211:40331) sent to 141.82.59.224:32789 timed out (after 3000 ms), retrying

I tried everything to get rid of the information that seems to be held somewhere that every node tries to connect to the former master node with the old port. At this point I can't imagine where the information is held and what to do, to reset this loop-thread-hangup at each node.

I've been looking around for a similar problem, but everything I found seems to have some other root problem and I don't know much about what to change at JGroups config.

Any suggestions?

Regards,

Felix

1. Re: Problem with GMS joining nodes when master node was killed

praveen.kumar Jun 11, 2010 4:03 AM (in response to felixreuthlinger)

Hi Felix,

This problem may be due to multiple NIC/IP address.Edit the following two files:

.../deploy/cluster-service.xml
.../deploy/tc5-cluster.sar/META-INF/jboss-service.xml



Here in bind_addr,you specify the IP address which you want to bind for multicasting.

Check if it will work.

Cheers,
Praveen Kumar
Actions
2. Re: Problem with GMS joining nodes when master node was killed

payne51558 Oct 21, 2010 5:03 PM (in response to felixreuthlinger)

Did you ever find a resolution to this? I am experiancing this same issue with 15 nodes in Jboss 5.1GA and the only way I have worked around is to change the multicast broadcast address and shutdown the nodes one at a time and update w/ the new multicast address. This only gets me around the isssue but seems to come back after a few days or if I manually down a node for maintenance.

2010-10-21 15:34:45,206 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (main) Initializing partition DefaultPartition
2010-10-21 15:34:45,277 INFO [STDOUT] (JBoss System Threads(1)-3)
---------------------------------------------------------
GMS: address is 10.230.3.15:54390 (cluster=DefaultPartition)
---------------------------------------------------------
2010-10-21 15:34:45,385 INFO [org.jboss.cache.jmx.PlatformMBeanServerRegistration] (main) JBossCache MBeans were successfully registered to the platform mbean server.
2010-10-21 15:34:45,449 INFO [STDOUT] (main)
---------------------------------------------------------
GMS: address is 10.230.3.15:54390 (cluster=DefaultPartition-HAPartitionCache)
---------------------------------------------------------
2010-10-21 15:34:45,501 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Number of cluster members: 15
2010-10-21 15:34:45,504 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Other members: 14
2010-10-21 15:34:48,464 WARN [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying
2010-10-21 15:34:51,467 WARN [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying
2010-10-21 15:34:54,470 WARN [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying

Also seeing this on the 10.230.3.16 node that 10.230.3.15 is trying to connect to:
2010-10-21 17:02:14,146 WARN [org.jgroups.protocols.pbcast.GMS] (ViewHandler,DefaultPartition-HAPartitionCache,10.230.3.16:42616) GMS flush by coordinator at 10.230.3.16:42616 failed

Appreciate your response!

Thanks

Cody

2010-10-21 15:34:45,206 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (main) Initializing partition DefaultPartition
2010-10-21 15:34:45,277 INFO [STDOUT] (JBoss System Threads(1)-3)
---------------------------------------------------------
GMS: address is 10.230.3.15:54390 (cluster=DefaultPartition)
---------------------------------------------------------
2010-10-21 15:34:45,385 INFO [org.jboss.cache.jmx.PlatformMBeanServerRegistration] (main) JBossCache MBeans were successfully registered to the platform mbean server.
2010-10-21 15:34:45,449 INFO [STDOUT] (main)
---------------------------------------------------------
GMS: address is 10.230.3.15:54390 (cluster=DefaultPartition-HAPartitionCache)
---------------------------------------------------------
2010-10-21 15:34:45,501 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Number of cluster members: 15
2010-10-21 15:34:45,504 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Other members: 14
2010-10-21 15:34:48,464 WARN [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying
2010-10-21 15:34:51,467 WARN [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying
2010-10-21 15:34:54,470 WARN [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying
Actions
3. Re: Problem with GMS joining nodes when master node was killed

felixreuthlinger Oct 22, 2010 3:10 AM (in response to payne51558)

Hey Praveen, hey Cody,

at first, I did not want to bind the address to a specific IP. Did not try this way @ Praveen Kumar.

And for your request @ Cody: no, I did not solve this problem... when this happened to me, I had to clean and restart the nodes more than once until somehow the information got lost and the nodes could join a well formed cluster again. But I can't remember the exact way how to get this running again.

Cheers
Felix
Actions
4. Re: Problem with GMS joining nodes when master node was killed

swapnath Jan 30, 2013 2:44 AM (in response to felixreuthlinger)

Hi Felix,

Did you find solution for this?, I've similar problem.
Actions

Go to original post