2 Replies Latest reply on Jan 20, 2014 8:31 AM by rhusar

    jboss 6.0 cluster with two MasterNode (NPE in CoreGroupCommunicationService)

    gcontini

      I jave a 2 node cluster. I'm using jboss 6.0.

      I have 2 servers that see each other well, server1 (10.143.89.206) is the master node (cluster coordinator).

       

      Now i pause server1 with

      {code}

      pkill -STOP -f java

      {code}

       

      Server2 rightfully detects the failure and becomes master:

      {code}

      2011-07-21 11:48:29,392 DEBUG [FD$Monitor] (Timer-2,<ADDR>) heartbeat missing from 10.143.89.206:1099 (number=4)

      2011-07-21 11:48:33,393 DEBUG [FD$Monitor] (Timer-5,<ADDR>) sending are-you-alive msg to 10.143.89.206:1099 (own address=10.143.89.207:1099)

      2011-07-21 11:48:33,393 DEBUG [FD$Monitor] (Timer-5,<ADDR>) [10.143.89.207:1099]: received no heartbeat ack from 10.143.89.206:1099 for 6 times (24000 milliseconds), suspecting it

      2011-07-21 11:48:33,396 DEBUG [FD$BroadcastTask] (Timer-3,<ADDR>) broadcasting SUSPECT message [suspected_mbrs=[10.143.89.206:1099]] to group

      ....

      2011-07-21 11:48:35,412 DEBUG [HASingletonController] (AsynchViewChangeHandler Thread) starting singleton, mSingleton=org.jboss.ha.singleton.HASingletonProfileManager@106051c1, mSingletonMBean=null

      2011-07-21 11:48:35,412 DEBUG [HASingletonImpl] (AsynchViewChangeHandler Thread) startSingleton() : elected for master singleton node

      {code}

       

      Now i can see he is the masterNode from the jmx console. Now i unpause server1:

      {code}

      pkill -CONT -f java

      {code}

       

       

      Server 1 wake up:

      {code}

      2011-07-21 11:48:55,682 WARN  [FD] (OOB-11,null) I was suspected by 10.143.89.207:1099; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      2011-07-21 11:48:55,685 DEBUG [FLUSH] (Incoming-17,null) 10.143.89.206:1099: received START_FLUSH but I am not flush participant, not responding

      ...

      {code}

       

      Server 2 decides he is the merge coordinator:

      {code}

      2011-07-21 11:49:00,231 DEBUG [Merger] (ViewHandler,clusterTest-HAPartition,10.143.89.207:1099) I (10.143.89.207:1099) will be the leader. Starting the merge task for [10.143.89.207:1099, 10.143.89.206:1099]

      {code}

       

      Server1 understands and install the view:

      {code}

      2011-07-21 11:48:59,361 INFO  [org.jboss.ha.framework.server.ClusterPartition.clusterTest] CoreGroupCommunicationService (Incoming-8,null) New cluster view for partition clusterTest: 3 (org.jboss.ha.core.framework.server.CoreGroupCommunicationService$GroupView@4c6cb02a delta: 0, merge: true)

      {code}

       

      But now if i go to the jmx console i see both nodes think they both think to be the master (MasterNode=True)...

      What's wrong here?

       

      In another similar run, putting logs at trace level i've got this exception on server1:

       

      {code}

      2011-07-20 18:51:21,768 TRACE [CoreGroupCommunicationService$RpcHandler] (Incoming-6,null) Partition clusterTest rpc call threw exception: java.lang.NullPointerException

              at org.jboss.modcluster.ha.HAModClusterService$RpcHandler.clusterStatusComplete(HAModClusterService.java:887) [:1.1.0.Final]

              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [:1.6.0_24]

              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [:1.6.0_24]

              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [:1.6.0_24]

              at java.lang.reflect.Method.invoke(Method.java:597) [:1.6.0_24]

              at org.jgroups.blocks.MethodCall.invoke(MethodCall.java:351) [:2.12.1.Final]

              at org.jboss.ha.core.framework.server.CoreGroupCommunicationService$RpcHandler.handle(CoreGroupCommunicationService.java:1971) [:1.0.0.Final]

              at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:577) [:2.12.1.Final]

              at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:488) [:2.12.1.Final]

              at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:364) [:2.12.1.Final]

              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:770) [:2.12.1.Final]

              at org.jboss.ha.core.jgroups.blocks.mux.DelegatingStateTransferUpHandler.up(DelegatingStateTransferUpHandler.java:63) [:1.0.0.Final]

              at org.jgroups.blocks.mux.MuxUpHandler.up(MuxUpHandler.java:99) [:2.12.1.Final]

              at org.jgroups.JChannel.up(JChannel.java:1484) [:2.12.1.Final]

              at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1074) [:2.12.1.Final]

              at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:477) [:2.12.1.Final]

              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:263) [:2.12.1.Final]

              at org.jgroups.protocols.FRAG2.up(FRAG2.java:189) [:2.12.1.Final]

              at org.jgroups.protocols.FlowControl.up(FlowControl.java:400) [:2.12.1.Final]

              at org.jgroups.protocols.FlowControl.up(FlowControl.java:418) [:2.12.1.Final]

              at org.jgroups.protocols.pbcast.GMS.up(GMS.java:891) [:2.12.1.Final]

              at org.jgroups.protocols.VIEW_SYNC.up(VIEW_SYNC.java:170) [:2.12.1.Final]

              at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:246) [:2.12.1.Final]

              at org.jgroups.protocols.UNICAST.up(UNICAST.java:309) [:2.12.1.Final]

              at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:838) [:2.12.1.Final]

              at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:667) [:2.12.1.Final]

              at org.jgroups.protocols.BARRIER.up(BARRIER.java:119) [:2.12.1.Final]

              at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:133) [:2.12.1.Final]

              at org.jgroups.protocols.FD.up(FD.java:275) [:2.12.1.Final]

              at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:275) [:2.12.1.Final]

              at org.jgroups.protocols.MERGE2.up(MERGE2.java:209) [:2.12.1.Final]

              at org.jgroups.protocols.Discovery.up(Discovery.java:293) [:2.12.1.Final]

              at org.jgroups.protocols.PING.up(PING.java:69) [:2.12.1.Final]

              at org.jgroups.stack.Protocol.up(Protocol.java:413) [:2.12.1.Final]

              at org.jgroups.protocols.TP.passMessageUp(TP.java:1109) [:2.12.1.Final]

              at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1665) [:2.12.1.Final]

              at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1647) [:2.12.1.Final]

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) [:1.6.0_24]

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) [:1.6.0_24]

              at java.lang.Thread.run(Thread.java:662) [:1.6.0_24]

       

      {code}

       

      Maybe somebody is instantiating ModClusterServiceDRMEntry with mcmpServerStates=null? (BasicConstructorJoinPoint.dispatch?)

       

      Thanks in advance.

      Gabriele