5 Replies Latest reply on Dec 6, 2011 5:08 AM by wdfink

Cluster doesn't work properly (HAPartition, JGroups)

evgeniy.khist Dec 1, 2011 11:35 AM

Hi!

I'm working on project that requires clustering.

Application runs on JBoss AS 5.1.0.

After some time nodes become inactive (stops processing requests). This problem appears randomly, but periodically.

Last time cluster suddenly crashes (3 of 4 nodes stopped serving requests) after 4 days of stable work.

Log:

org.jboss.messaging.core.impl.postoffice.MessagingPostOffice cannot find node ID for address

...

org.jgroups.protocols.pbcast.GMS GMS flush by coordinator at failed

...

2011-11-30 11:58:30,718 WARN [CloserThread] org.jgroups.protocols.pbcast.GMS join(10.*.*.177:55489) sent to 10.*.*.168:36319 timed out (after 3000 ms), retrying

2011-11-30 11:58:33,721 WARN [CloserThread] org.jgroups.protocols.pbcast.GMS join(10.*.*.177:55489) sent to 10.*.*.168:36319 timed out (after 3000 ms), retrying

...

2011-11-30 19:15:20,705 INFO org.jboss.ha.framework.interfaces.HAPartition.mcps Suspected member: 10.*.*.177:55612

2011-11-30 19:15:20,841 WARN org.jgroups.protocols.pbcast.NAKACK 10.*.*.168:7901] discarded message from non-member 10.*.*.177:7902, my view is [10.*.0.177:7901|10]

[10.*.*.177:7901, 10.*.*.168:7900, 10.*.*.168:7901, 10.*.*.168:7902, 10.*.*.177:7900]

2011-11-30 19:15:23,140 INFO org.jboss.ha.framework.interfaces.HAPartition.mcps New cluster view for partition mcps: 4 ([10.*.*.177:1999, 10.*.*.177:1199, 10.*.*.16

8:1099, 10.*.*.168:1199, 10.*.*.168:1999] delta: -1)

2011-11-30 19:15:23,143 WARN org.jgroups.protocols.pbcast.NAKACK 10.*.*.168:49865] discarded message from non-member 10.*.*.177:55612, my view is [10.*.*.177:46275|

4] [10.*.*.177:46275, 10.*.*.177:60615, 10.*.*.168:49865, 10.*.*.168:38167, 10.*.*.168:39465]

2011-11-30 19:15:23,153 INFO org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mcps I am (10.*.*.168:1099) received membershipChanged event:

2011-11-30 19:15:23,153 INFO org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mcps Dead members: 1 ([10.*.*.177:1099])

2011-11-30 19:15:23,153 INFO org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mcps New Members : 0 ([])

2011-11-30 19:15:23,153 INFO org.jboss.ha.framework.server.DistributedReplicantManagerImpl.mcps All Members : 5 ([10.*.*.177:1999, 10.*.*.177:1199, 10.*.*.168:1099,

10.*.*.168:1199, 10.*.*.168:1999])

2011-11-30 19:15:23,302 WARN org.jgroups.protocols.pbcast.NAKACK 10.*.*.168:49865] discarded message from non-member 10.*.*.177:55612, my view is [10.*.*.177:46275|

4] [10.*.*.177:46275, 10.*.*.177:60615, 10.*.*.168:49865, 10.*.*.168:38167, 10.*.*.168:39465]

2011-11-30 19:15:23,302 WARN org.jgroups.protocols.pbcast.NAKACK 10.*.*.168:49865] discarded message from non-member 10.44.0.177:55612, my view is [10.*.*.177:46275|

...

Any help will be appreciated.

1. Re: Cluster doesn't work properly (HAPartition, JGroups)

wdfink Dec 2, 2011 2:24 AM (in response to evgeniy.khist)

As the cluster work for 4 days it looks like correct configured.

The problems you might have
- network problems (drop messages)
- JVM GC problems
do you have a GC logfile? check whether the GabageCollector will stop the JVM
Heap dumps?
Does the JVM process still alive?

So please describe the situation in more detail.
Also are the nodes reachable if you use jmx console or jndi/rmi direct?
Actions
2. Re: Cluster doesn't work properly (HAPartition, JGroups)

evgeniy.khist Dec 2, 2011 8:10 AM (in response to wdfink)

1. netstat –su

Server1
Udp:
    344220341 packets received
    5439 packets to unknown port received.
    1488574 packet receive errors
    267572900 packets sent

Server2
Udp:
254074898 packets received
    1285 packets to unknown port received.
    1042284 packet receive errors
158695095 packets sent

2. JVM process is still alive. ps -a | grep java prints 3 processes on each node.
top command shows, that one node consumes ~400% CPU while two other only 60% and 10%. On the node, that consumes 400% CPU is deployed application as HASingleton, that makes a lot of communications with database and external systems vis web services.

3. I've tested multicast with jgroups.jar org.jgroups.tests.McastSenderTest and org.jgroups.tests.McastReceiverTest and no errors were found.

4. I found some wierd records in gc.log

10296.021: [GC 4767017K->2512442K(7978496K), 0.1893870 secs]
10302.282: [GC 4885306K->2534090K(7925440K), 0.1945140 secs]
10304.710: [GC 3441249K->2533003K(7995392K), 0.2057490 secs]
10304.916: [Full GC 2533003K->2093480K(7995392K), 1.8050030 secs]
10306.733: [GC 2099115K->2094632K(7995392K), 0.1128950 secs]
10306.846: [Full GC 2094632K->2087924K(7995392K), 2.0128000 secs]
10308.890: [GC 2117032K->2089748K(7995392K), 0.1083420 secs]
10308.998: [Full GC 2089748K->2087185K(7995392K), 1.6173830 secs]
10310.624: [GC 2087696K->2087409K(7995392K), 0.1084920 secs]
10310.732: [Full GC 2087409K->2087077K(7995392K), 1.6917490 secs]
10312.424: [GC 2087077K->2087077K(7995392K), 0.1097680 secs]
10312.534: [Full GC 2087077K->1729346K(7995392K), 5.7572290 secs]
10322.869: [GC 4088642K->1903865K(7995392K), 0.1089180 secs]
10326.940: [GC 4263161K->2063524K(7997632K), 0.1548990 secs]
10330.858: [GC 4425060K->2252980K(7995392K), 0.1604920 secs]
10340.703: [GC 4614516K->2374188K(7995392K), 0.1710790 secs]
10345.796: [GC 4733484K->2389607K(7995392K), 0.0996860 secs]

But later, as you can see, everything goes just fine.
Actions
3. Re: Cluster doesn't work properly (HAPartition, JGroups)

wdfink Dec 3, 2011 4:48 AM (in response to evgeniy.khist)

1% package lost seems to me that it might be a problem for your cluster
The Full GC of 1...5 sec. is not dangerous, it depends to the load. You might optimize it later.

Think you have to check whether 400% CPU is normal.
And check the network at the time where your cluster gets broken or can't merge.
Actions
4. Re: Cluster doesn't work properly (HAPartition, JGroups)

evgeniy.khist Dec 3, 2011 7:07 AM (in response to wdfink)

Network administrators denies 1% package lost could be the source of problem. But anyway I think it is not normal too.

On node, consuming 400% CPU is deployed heavy module as HASingletone for high availability purposes. This module fails sometimes. So when it fails on node 1, node 2 become master-node and this module starts there.
Maybe this isn't the best architecture, do you know better solution? And can it be the source of cluster problem?
Actions
5. Re: Cluster doesn't work properly (HAPartition, JGroups)

wdfink Dec 6, 2011 5:08 AM (in response to evgeniy.khist)

If you have such havy application it might cause the problem with the cluster.
Maybe you have to rethink your approach and try to find a clustered solution to have somthing like a loadbalancing and not an overloading of one instance.
But I suppose this will be not an issue for the community. Maybe you think about a Red Hat consultant and/or a support subscription to have an assistance.
Actions

Go to original post