2 Replies Latest reply on Jan 20, 2014 8:31 AM by rhusar

jboss 6.0 cluster with two MasterNode (NPE in CoreGroupCommunicationService)

gcontini Jul 22, 2011 6:22 AM

I jave a 2 node cluster. I'm using jboss 6.0.

I have 2 servers that see each other well, server1 (10.143.89.206) is the master node (cluster coordinator).

Now i pause server1 with

{code}

pkill -STOP -f java

{code}

Server2 rightfully detects the failure and becomes master:

{code}

2011-07-21 11:48:29,392 DEBUG [FD$Monitor] (Timer-2,<ADDR>) heartbeat missing from 10.143.89.206:1099 (number=4)

2011-07-21 11:48:33,393 DEBUG [FD$Monitor] (Timer-5,<ADDR>) sending are-you-alive msg to 10.143.89.206:1099 (own address=10.143.89.207:1099)

2011-07-21 11:48:33,393 DEBUG [FD$Monitor] (Timer-5,<ADDR>) [10.143.89.207:1099]: received no heartbeat ack from 10.143.89.206:1099 for 6 times (24000 milliseconds), suspecting it

2011-07-21 11:48:33,396 DEBUG [FD$BroadcastTask] (Timer-3,<ADDR>) broadcasting SUSPECT message [suspected_mbrs=[10.143.89.206:1099]] to group

....

2011-07-21 11:48:35,412 DEBUG [HASingletonController] (AsynchViewChangeHandler Thread) starting singleton, mSingleton=org.jboss.ha.singleton.HASingletonProfileManager@106051c1, mSingletonMBean=null

2011-07-21 11:48:35,412 DEBUG [HASingletonImpl] (AsynchViewChangeHandler Thread) startSingleton() : elected for master singleton node

{code}

Now i can see he is the masterNode from the jmx console. Now i unpause server1:

{code}

pkill -CONT -f java

{code}

Server 1 wake up:

{code}

2011-07-21 11:48:55,682 WARN [FD] (OOB-11,null) I was suspected by 10.143.89.207:1099; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

2011-07-21 11:48:55,685 DEBUG [FLUSH] (Incoming-17,null) 10.143.89.206:1099: received START_FLUSH but I am not flush participant, not responding

...

{code}

Server 2 decides he is the merge coordinator:

{code}

2011-07-21 11:49:00,231 DEBUG [Merger] (ViewHandler,clusterTest-HAPartition,10.143.89.207:1099) I (10.143.89.207:1099) will be the leader. Starting the merge task for [10.143.89.207:1099, 10.143.89.206:1099]

{code}

Server1 understands and install the view:

{code}

2011-07-21 11:48:59,361 INFO [org.jboss.ha.framework.server.ClusterPartition.clusterTest] CoreGroupCommunicationService (Incoming-8,null) New cluster view for partition clusterTest: 3 (org.jboss.ha.core.framework.server.CoreGroupCommunicationService$GroupView@4c6cb02a delta: 0, merge: true)

{code}

But now if i go to the jmx console i see both nodes think they both think to be the master (MasterNode=True)...

What's wrong here?

In another similar run, putting logs at trace level i've got this exception on server1:

{code}
2011-07-20 18:51:21,768 TRACE [CoreGroupCommunicationService$RpcHandler] (Incoming-6,null) Partition clusterTest rpc call threw exception: java.lang.NullPointerException
        at org.jboss.modcluster.ha.HAModClusterService$RpcHandler.clusterStatusComplete(HAModClusterService.java:887) [:1.1.0.Final]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [:1.6.0_24]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [:1.6.0_24]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [:1.6.0_24]
        at java.lang.reflect.Method.invoke(Method.java:597) [:1.6.0_24]
        at org.jgroups.blocks.MethodCall.invoke(MethodCall.java:351) [:2.12.1.Final]
        at org.jboss.ha.core.framework.server.CoreGroupCommunicationService$RpcHandler.handle(CoreGroupCommunicationService.java:1971) [:1.0.0.Final]
        at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:577) [:2.12.1.Final]
        at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:488) [:2.12.1.Final]
        at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:364) [:2.12.1.Final]
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:770) [:2.12.1.Final]
        at org.jboss.ha.core.jgroups.blocks.mux.DelegatingStateTransferUpHandler.up(DelegatingStateTransferUpHandler.java:63) [:1.0.0.Final]
        at org.jgroups.blocks.mux.MuxUpHandler.up(MuxUpHandler.java:99) [:2.12.1.Final]
        at org.jgroups.JChannel.up(JChannel.java:1484) [:2.12.1.Final]
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074) [:2.12.1.Final]
        at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:477) [:2.12.1.Final]
        at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:263) [:2.12.1.Final]
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:189) [:2.12.1.Final]
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:400) [:2.12.1.Final]
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:418) [:2.12.1.Final]
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:891) [:2.12.1.Final]
        at org.jgroups.protocols.VIEW_SYNC.up(VIEW_SYNC.java:170) [:2.12.1.Final]
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:246) [:2.12.1.Final]
        at org.jgroups.protocols.UNICAST.up(UNICAST.java:309) [:2.12.1.Final]
        at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:838) [:2.12.1.Final]
        at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:667) [:2.12.1.Final]
        at org.jgroups.protocols.BARRIER.up(BARRIER.java:119) [:2.12.1.Final]
        at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:133) [:2.12.1.Final]
        at org.jgroups.protocols.FD.up(FD.java:275) [:2.12.1.Final]
        at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:275) [:2.12.1.Final]
        at org.jgroups.protocols.MERGE2.up(MERGE2.java:209) [:2.12.1.Final]
        at org.jgroups.protocols.Discovery.up(Discovery.java:293) [:2.12.1.Final]
        at org.jgroups.protocols.PING.up(PING.java:69) [:2.12.1.Final]
        at org.jgroups.stack.Protocol.up(Protocol.java:413) [:2.12.1.Final]
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1109) [:2.12.1.Final]
        at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1665) [:2.12.1.Final]
        at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1647) [:2.12.1.Final]
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [:1.6.0_24]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [:1.6.0_24]
        at java.lang.Thread.run(Thread.java:662) [:1.6.0_24]
 
{code}

Maybe somebody is instantiating ModClusterServiceDRMEntry with mcmpServerStates=null? (BasicConstructorJoinPoint.dispatch?)

Thanks in advance.

Gabriele

server2.log.zip 4.5 KB
server1.log.zip 1.5 KB

1. Re: jboss 6.0 cluster with two MasterNode (NPE in CoreGroupCommunicationService)

stewart_g Jan 13, 2014 7:09 AM (in response to gcontini)

hi Gabriele

I was just about to put up a similar post about this until I found your query, did you ever receive a response from anyone or did you find out how to fix it yourself?

Best Regards
Stewart
Actions
2. Re: jboss 6.0 cluster with two MasterNode (NPE in CoreGroupCommunicationService)

rhusar Jan 20, 2014 8:31 AM (in response to stewart_g)

Few things:

First of all, the test is wrong. I am not aware of any real situation that would be simulated by stopping the proces. Remember that when you do that, the sockets remain "open" and the process is still there, but it won't process anything from the socket. This will also make it more complicated for failure detection mechanism (e.g. http://www.jgroups.org/javadoc/org/jgroups/protocols/FD_SOCK.html ) to work. The real scenarios would be more like pulling the cable for some time (network partition) or crashing the process.

Nevertheless, I remember there was similar issue I think in CoreGroupCommunicationService that would lead into 2 singleton masters (I am sure you can find it and possibly backport the fix for AS 6). It is already fixed in EAP 5.

However usage of HAModClusterService is discouraged, and it has been discontinued since AS 7.

HTH,
Rado
Actions

Go to original post