3 Replies Latest reply on Jul 22, 2012 4:01 AM by tfromm

    JGroups FD fail?

    tfromm

      I've got 3 nodes in cluster. Now I stop (kill -STOP) one node (just for simulating stop-the-world-pauses from GC).

      After some time and the node is dropped from the cluster, I resume their operations (kill -CONT). Unfortunality, the node does not update their member view.

       

      The neighbour node sucessfully detect the stopped node:

      DEBUG 11:01:49,232 [Timer-2,obelix-61184] FD                    sending are-you-alive msg to obelix-21216 (own address=obelix-61184)
      DEBUG 11:01:49,233 [Timer-2,obelix-61184] FD                    heartbeat missing from obelix-21216 (number=0)

      ...

      DEBUG 11:02:19,235 [Timer-3,obelix-61184] FD                    broadcasting SUSPECT message [suspected_mbrs=[obelix-21216]] to group

      ...

       

      Ok so far, the stopped node is not longer part of the cluster.

       

      Now I resume the stopped node:

      DEBUG 11:02:57,697 [Timer-2,obelix-21216] UNICAST2              obelix-21216: removed expired connection for obelix-61184 (113275 ms old) from send_table
      DEBUG 11:02:57,697 [Timer-2,obelix-21216] UNICAST2              obelix-21216: removed expired connection for obelix-27398 (113293 ms old) from recv_table
      DEBUG 11:02:57,696 [Timer-5,obelix-21216] FD                    sending are-you-alive msg to obelix-27398 (own address=obelix-21216)
      DEBUG 11:02:57,698 [Timer-2,obelix-21216] UNICAST2              obelix-21216: removed expired connection for obelix-61184 (113290 ms old) from recv_table
      WARN 11:02:57,699 [OOB-5,obelix-21216] FD                    I was suspected by obelix-61184; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      2012-07-20 11:02:57,699 WARN  [FD] (OOB-5,obelix-21216) I was suspected by obelix-61184; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      WARN 11:02:57,700 [Incoming-2,obelix-21216] GMS                   obelix-21216: not member of view [obelix-27398|3]; discarding it

      2012-07-20 11:02:57,700 WARN  [GMS] (Incoming-2,obelix-21216) obelix-21216: not member of view [obelix-27398|3]; discarding it

      DEBUG 11:02:57,700 [OOB-14,obelix-21216] STABLE                obelix-21216: received digest from obelix-27398 (digest=obelix-61184: [0 (0)], obelix-27398: [5 (5)]) which does not match my own digest (obelix-61184: [0 (0)], obelix-21216: [0 (0)], obelix-27398: [4 (4)]): ignoring digest and re-initializing own digest
      DEBUG 11:02:57,704 [OOB-2,obelix-21216] STABLE                obelix-21216: received digest from obelix-61184 (digest=obelix-61184: [0 (0)], obelix-27398: [5 (5)]) which does not match my own digest (obelix-61184: [0 (0)], obelix-21216: [0 (0)], obelix-27398: [5 (5)]): ignoring digest and re-initializing own digest
      DEBUG 11:03:11,111 [OOB-7,obelix-21216] STABLE                obelix-21216: received digest from obelix-61184 (digest=obelix-61184: [0 (0)], obelix-27398: [5 (5)]) which does not match my own digest (obelix-61184: [0 (0)], obelix-21216: [0 (0)], obelix-27398: [5 (5)]): ignoring digest and re-initializing own digest

      ...

      DEBUG 11:04:12,702 [Timer-3,obelix-21216] FD                    sending are-you-alive msg to obelix-27398 (own address=obelix-21216)
      DEBUG 11:04:24,194 [OOB-17,obelix-21216] STABLE                obelix-21216: received digest from obelix-61184 (digest=obelix-61184: [0 (0)], obelix-27398: [5 (5)]) which does not match my own digest (obelix-61184: [0 (0)], obelix-21216: [0 (0)], obelix-27398: [5 (5)]): ignoring digest and re-initializing own digest

       

      ...

       

      The node keeps up running, still thinking there were additional members. :-/

      My current jgroups configuration to this is attached. This test was executed by 5.2.0 Alpha and the containing JGroups 3.1

      Any ideas?

        • 1. Re: JGroups FD fail?
          vblagojevic

          Just to clarify, you are observing that the suspected member A that was eventually excluded from the view of other members still believes it is a member of cluster?

          • 2. Re: JGroups FD fail?
            tfromm

            Thats correct. In the case above the node obelix-21216 still thinks he is part of the cluster with 3 members. I don't get view changed events in this node and also when looking at the jmx stats, there are still listed 3 members. (Once a node left the cluster I don't want that this node is merged back, so I have no MERGE protocol at jgroups.)

            • 3. Re: JGroups FD fail?
              tfromm

              After short chat with Vladimir in IRC it cames out, that MERGE is required at the moment. He mentioned there exists maybe a solution to force a rejoin at infinispan level when merge happens.

               

              Any thoughts?