3 Replies Latest reply on Jan 25, 2011 5:11 PM by manik

    Merge Failure / Coordinator Selection

    shane_dev

      We are seeing a scenario where there is a failure during a merge because of the coordinator selection.

       

      After looking at JGroupsTransport, I noticed that it assumes the coordinator of a merge view must be the coordinator of one of the subgroups being merged. However, our testing has shown that this is not always the case. Here are the logs of one such occasion.

       

      Node 04 Leaves

       

      JGroupsTransport: Received new cluster view: [01, 02, 03]

      DistributionManagerImpl: Detected a view change.  Member list changed from [04, 01, 02, 03] to [01, 02, 03]

      DistributionManagerImpl: This is a LEAVE event!  Node 04 has just left

      ... rehashing ...

       

      Node 04 Joins

       

      JGroupsTransport: Received new, MERGED cluster view: MergeView::[02, 01, 03, 04], subgroups=[[04], [01, 02, 03]]

       

      java.lang.NullPointerException

        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.needsToRejoin(JGroupsTransport.java:528)

        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.access$000(JGroupsTransport.java:88)

        at org.infinispan.remoting.transport.jgroups.JGroupsTransport$NotifyMerge.emitNotification(JGroupsTransport.java:465)

        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.viewAccepted(JGroupsTransport.java:508)

        ...

       

      You'll see that after node 04 leaves, the new coordinator is 01. Then 04 is started. I realize it should have joined the existing cluster but for whatever reason it did not. Now, the two clusters are merged with 01 being one coordinator and 04 being the other. The code expects one of these two nodes to be the new coordinator. However, it appears that 02 is selected as the new coordinator. This results in the null pointer exception.

       

      I suppose my question is this. Is it correct to assume that the new coordinator must be one of the subgroup coordinators?

       

      If so, how is it possible that a different node was selected as the coordinator?