4 Replies Latest reply on Jan 30, 2013 2:44 AM by swapnath

    Problem with GMS joining nodes when master node was killed

    felixreuthlinger

      Hi,

       

      I have a Problem starting up my JBoss 5.1.0.GA cluster again. I had one node running which was selected as master for GMS. Then I started two other nodes that joined the cluster at first. But then i had bring down all nodes again because of deployment errors. The one node that was selected as master node didn't shut gracefully, so I had to kill the java process.

       

      After I did that every single node seems to be trying to connect to the former master node at the former port:

       

      Log from the master node:

      2010-05-31 12:23:31,445 WARN  [org.jgroups.protocols.pbcast.GMS] join(141.82.59.224:54939) sent to 141.82.59.224:32789 timed out (after 3000 ms), retrying
      2010-05-31 12:23:36,445 WARN  [org.jgroups.protocols.pbcast.GMS] join(141.82.59.224:54939) sent to 141.82.59.224:32789 timed out (after 3000 ms), retrying

       

      Log from a random other node:

      2010-05-31 12:07:49,581 WARN  [org.jgroups.protocols.pbcast.GMS] join(141.82.59.211:40331) sent to 141.82.59.224:32789 timed out (after 3000 ms), retrying
      2010-05-31 12:07:54,582 WARN  [org.jgroups.protocols.pbcast.GMS] join(141.82.59.211:40331) sent to 141.82.59.224:32789 timed out (after 3000 ms), retrying

       

      I tried everything to get rid of the information that seems to be held somewhere that every node tries to connect to the former master node with the old port. At this point I can't imagine where the information is held and what to do, to reset this loop-thread-hangup at each node.

       

      I've been looking around for a similar problem, but everything I found seems to have some other root problem and I don't know much about what to change at JGroups config.

       

      Any suggestions?

       

      Regards,

      Felix

        • 1. Re: Problem with GMS joining nodes when master node was killed
          praveen.kumar

          Hi Felix,

           

            This problem may be due to multiple NIC/IP address.Edit the following two files:

           

          .../deploy/cluster-service.xml

          .../deploy/tc5-cluster.sar/META-INF/jboss-service.xml

           


          <!--
          The default UDP stack:
          - If you have a multihomed machine, set the UDP protocol's bind_addr attribute to the
          appropriate NIC IP address, e.g bind_addr="192.168.0.2".
          - On Windows machines, because of the media sense feature being broken with multicast
          (even after disabling media sense) set the UDP protocol's loopback attribute to true
          -->

           

          Here in bind_addr,you specify the IP address which you want to bind for multicasting.

           

          Check if it will work.

           

          Cheers,

          Praveen Kumar

          • 2. Re: Problem with GMS joining nodes when master node was killed
            payne51558

            Did you ever find a resolution to this?  I am experiancing this same issue with 15 nodes in Jboss 5.1GA and the only way I have worked around is to change the multicast broadcast address and shutdown the nodes one at a time and update w/ the new multicast address.  This only gets me around the isssue but seems to come back after a few days or if I manually down a node for maintenance.

             

             

            2010-10-21 15:34:45,206 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (main) Initializing partition DefaultPartition

            2010-10-21 15:34:45,277 INFO  [STDOUT] (JBoss System Threads(1)-3)

            ---------------------------------------------------------

            GMS: address is 10.230.3.15:54390 (cluster=DefaultPartition)

            ---------------------------------------------------------

            2010-10-21 15:34:45,385 INFO  [org.jboss.cache.jmx.PlatformMBeanServerRegistration] (main) JBossCache MBeans were successfully registered to the platform mbean server.

            2010-10-21 15:34:45,449 INFO  [STDOUT] (main)

            ---------------------------------------------------------

            GMS: address is 10.230.3.15:54390 (cluster=DefaultPartition-HAPartitionCache)

            ---------------------------------------------------------

            2010-10-21 15:34:45,501 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Number of cluster members: 15

            2010-10-21 15:34:45,504 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Other members: 14

            2010-10-21 15:34:48,464 WARN  [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying

            2010-10-21 15:34:51,467 WARN  [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying

            2010-10-21 15:34:54,470 WARN  [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying

             

            Also seeing this on the 10.230.3.16 node that 10.230.3.15 is trying to connect to:

            2010-10-21 17:02:14,146 WARN  [org.jgroups.protocols.pbcast.GMS] (ViewHandler,DefaultPartition-HAPartitionCache,10.230.3.16:42616) GMS flush by coordinator at 10.230.3.16:42616 failed

             

             

            Appreciate your response!

             

            Thanks

             

            Cody

             

            2010-10-21 15:34:45,206 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (main) Initializing partition DefaultPartition
            2010-10-21 15:34:45,277 INFO  [STDOUT] (JBoss System Threads(1)-3)
            ---------------------------------------------------------
            GMS: address is 10.230.3.15:54390 (cluster=DefaultPartition)
            ---------------------------------------------------------
            2010-10-21 15:34:45,385 INFO  [org.jboss.cache.jmx.PlatformMBeanServerRegistration] (main) JBossCache MBeans were successfully registered to the platform mbean server.
            2010-10-21 15:34:45,449 INFO  [STDOUT] (main)
            ---------------------------------------------------------
            GMS: address is 10.230.3.15:54390 (cluster=DefaultPartition-HAPartitionCache)
            ---------------------------------------------------------
            2010-10-21 15:34:45,501 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Number of cluster members: 15
            2010-10-21 15:34:45,504 INFO  [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] (JBoss System Threads(1)-3) Other members: 14
            2010-10-21 15:34:48,464 WARN  [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying
            2010-10-21 15:34:51,467 WARN  [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying
            2010-10-21 15:34:54,470 WARN  [org.jgroups.protocols.pbcast.GMS] (main) join(10.230.3.15:54390) sent to 10.230.3.16:42616 timed out (after 3000 ms), retrying
            • 3. Re: Problem with GMS joining nodes when master node was killed
              felixreuthlinger

              Hey Praveen, hey Cody,

               

              at first, I did not want to bind the address to a specific IP. Did not try this way @ Praveen Kumar.

               

              And for your request @ Cody: no, I did not solve this problem... when this happened to me, I had to clean and restart the nodes more than once until somehow the information got lost and the nodes could join a well formed cluster again. But I can't remember the exact way how to get this running again.

               

              Cheers

              Felix

              • 4. Re: Problem with GMS joining nodes when master node was killed
                swapnath

                Hi  Felix,

                 

                  Did you find solution for this?, I've similar problem.