5 Replies Latest reply on Dec 6, 2011 5:08 AM by wdfink

    Cluster doesn't work properly (HAPartition, JGroups)

    evgeniy.khist

      Hi!

      I'm working on project that requires clustering.

      Application runs on JBoss AS 5.1.0.

      After some time nodes become inactive (stops processing requests). This problem appears randomly, but periodically.

      Last time cluster suddenly crashes (3 of 4 nodes stopped serving requests) after 4 days of stable work.

       

      Log:

       

      org.jboss.messaging.core.impl.postoffice.MessagingPostOffice cannot find node ID for address

      ...

       

      org.jgroups.protocols.pbcast.GMS GMS flush by coordinator at failed

      ...

       

      2011-11-30 11:58:30,718 WARN  [CloserThread] org.jgroups.protocols.pbcast.GMS join(10.*.*.177:55489) sent to 10.*.*.168:36319 timed out (after 3000 ms), retrying

      2011-11-30 11:58:33,721 WARN  [CloserThread] org.jgroups.protocols.pbcast.GMS join(10.*.*.177:55489) sent to 10.*.*.168:36319 timed out (after 3000 ms), retrying

      ...

       

      2011-11-30 19:15:20,705 INFO  org.jboss.ha.framework.interfaces.HAPartition.mcps Suspected member: 10.*.*.177:55612

      2011-11-30 19:15:20,841 WARN  org.jgroups.protocols.pbcast.NAKACK 10.*.*.168:7901] discarded message from non-member 10.*.*.177:7902, my view is [10.*.0.177:7901|10]

      [10.*.*.177:7901, 10.*.*.168:7900, 10.*.*.168:7901, 10.*.*.168:7902, 10.*.*.177:7900]

      2011-11-30 19:15:23,140 INFO  org.jboss.ha.framework.interfaces.HAPartition.mcps New cluster view for partition mcps: 4 ([10.*.*.177:1999, 10.*.*.177:1199, 10.*.*.16

      8:1099, 10.*.*.168:1199, 10.*.*.168:1999] delta: -1)

      2011-11-30 19:15:23,143 WARN  org.jgroups.protocols.pbcast.NAKACK 10.*.*.168:49865] discarded message from non-member 10.*.*.177:55612, my view is [10.*.*.177:46275|

      4] [10.*.*.177:46275, 10.*.*.177:60615, 10.*.*.168:49865, 10.*.*.168:38167, 10.*.*.168:39465]

      2011-11-30 19:15:23,153 INFO org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mcps I am (10.*.*.168:1099) received membershipChanged event:

      2011-11-30 19:15:23,153 INFO org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mcps Dead members: 1 ([10.*.*.177:1099])

      2011-11-30 19:15:23,153 INFO org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mcps New Members : 0 ([])

      2011-11-30 19:15:23,153 INFO org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mcps All Members : 5 ([10.*.*.177:1999, 10.*.*.177:1199, 10.*.*.168:1099,

      1. 10.*.*.168:1199, 10.*.*.168:1999])

      2011-11-30 19:15:23,302 WARN  org.jgroups.protocols.pbcast.NAKACK 10.*.*.168:49865] discarded message from non-member 10.*.*.177:55612, my view is [10.*.*.177:46275|

      4] [10.*.*.177:46275, 10.*.*.177:60615, 10.*.*.168:49865, 10.*.*.168:38167, 10.*.*.168:39465]

      2011-11-30 19:15:23,302 WARN  org.jgroups.protocols.pbcast.NAKACK 10.*.*.168:49865] discarded message from non-member 10.44.0.177:55612, my view is [10.*.*.177:46275|

      ...

       

      Any help will be appreciated.

        • 1. Re: Cluster doesn't work properly (HAPartition, JGroups)
          wdfink

          As the cluster work for 4 days it looks like correct configured.

           

          The problems you might have

          - network problems (drop messages)

          - JVM GC problems

            do you have a GC logfile? check whether the GabageCollector will stop the JVM

            Heap dumps?

            Does the JVM process still alive?

           

          So please describe the situation in more detail.

          Also are the nodes reachable if you use jmx console or jndi/rmi direct?

          • 2. Re: Cluster doesn't work properly (HAPartition, JGroups)
            evgeniy.khist

            1. netstat –su

             

            Server1

            Udp:

                344220341 packets received

                5439 packets to unknown port received.

                1488574 packet receive errors

                267572900 packets sent

             

            Server2

            Udp:

            254074898 packets received

                1285 packets to unknown port received.

                1042284 packet receive errors

            158695095 packets sent

             

             

            2. JVM process is still alive. ps -a | grep java prints 3 processes on each node.

            top command shows, that one node consumes ~400% CPU while two other only 60% and 10%. On the node, that consumes 400% CPU is deployed application as HASingleton, that makes a lot of communications with database and external systems vis web services.

             

            3. I've tested multicast with jgroups.jar org.jgroups.tests.McastSenderTest and org.jgroups.tests.McastReceiverTest and no errors were found.

             

            4. I found some wierd records in gc.log

             

            10296.021: [GC 4767017K->2512442K(7978496K), 0.1893870 secs]

            10302.282: [GC 4885306K->2534090K(7925440K), 0.1945140 secs]

            10304.710: [GC 3441249K->2533003K(7995392K), 0.2057490 secs]

            10304.916: [Full GC 2533003K->2093480K(7995392K), 1.8050030 secs]

            10306.733: [GC 2099115K->2094632K(7995392K), 0.1128950 secs]

            10306.846: [Full GC 2094632K->2087924K(7995392K), 2.0128000 secs]

            10308.890: [GC 2117032K->2089748K(7995392K), 0.1083420 secs]

            10308.998: [Full GC 2089748K->2087185K(7995392K), 1.6173830 secs]

            10310.624: [GC 2087696K->2087409K(7995392K), 0.1084920 secs]

            10310.732: [Full GC 2087409K->2087077K(7995392K), 1.6917490 secs]

            10312.424: [GC 2087077K->2087077K(7995392K), 0.1097680 secs]

            10312.534: [Full GC 2087077K->1729346K(7995392K), 5.7572290 secs]

            10322.869: [GC 4088642K->1903865K(7995392K), 0.1089180 secs]

            10326.940: [GC 4263161K->2063524K(7997632K), 0.1548990 secs]

            10330.858: [GC 4425060K->2252980K(7995392K), 0.1604920 secs]

            10340.703: [GC 4614516K->2374188K(7995392K), 0.1710790 secs]

            10345.796: [GC 4733484K->2389607K(7995392K), 0.0996860 secs]

             

            But later, as you can see, everything goes just fine.

            • 3. Re: Cluster doesn't work properly (HAPartition, JGroups)
              wdfink

              1% package lost seems to me that it might be a problem for your cluster

              The Full GC of 1...5 sec. is not dangerous, it depends to the load. You might optimize it later.

               

               

              Think you have to check whether 400% CPU is normal.

              And check the network at the time where your cluster gets broken or can't merge.

              • 4. Re: Cluster doesn't work properly (HAPartition, JGroups)
                evgeniy.khist

                Network administrators denies 1% package lost could be the source of problem. But anyway I think it is not normal too.

                 

                On node, consuming 400% CPU is deployed heavy module as HASingletone for high availability purposes. This module fails sometimes. So when it fails on node 1, node 2 become master-node and this module starts there.

                Maybe this isn't the best architecture, do you know better solution? And can it be the source of cluster problem?

                • 5. Re: Cluster doesn't work properly (HAPartition, JGroups)
                  wdfink

                  If you have such havy application it might cause the problem with the cluster.

                  Maybe you have to rethink your approach and try to find a clustered solution to have somthing like a loadbalancing and not an overloading of one instance.

                  But I suppose this will be not an issue for the community. Maybe you think about a Red Hat consultant and/or a support subscription to have an assistance.