JGroups FD fail?
tfromm Jul 20, 2012 5:24 AMI've got 3 nodes in cluster. Now I stop (kill -STOP) one node (just for simulating stop-the-world-pauses from GC).
After some time and the node is dropped from the cluster, I resume their operations (kill -CONT). Unfortunality, the node does not update their member view.
The neighbour node sucessfully detect the stopped node:
DEBUG 11:01:49,232 [Timer-2,obelix-61184] FD | sending are-you-alive msg to obelix-21216 (own address=obelix-61184) |
DEBUG 11:01:49,233 [Timer-2,obelix-61184] FD | heartbeat missing from obelix-21216 (number=0) |
...
DEBUG 11:02:19,235 [Timer-3,obelix-61184] FD | broadcasting SUSPECT message [suspected_mbrs=[obelix-21216]] to group |
...
Ok so far, the stopped node is not longer part of the cluster.
Now I resume the stopped node:
DEBUG 11:02:57,697 [Timer-2,obelix-21216] UNICAST2 | obelix-21216: removed expired connection for obelix-61184 (113275 ms old) from send_table |
DEBUG 11:02:57,697 [Timer-2,obelix-21216] UNICAST2 | obelix-21216: removed expired connection for obelix-27398 (113293 ms old) from recv_table |
DEBUG 11:02:57,696 [Timer-5,obelix-21216] FD | sending are-you-alive msg to obelix-27398 (own address=obelix-21216) |
DEBUG 11:02:57,698 [Timer-2,obelix-21216] UNICAST2 | obelix-21216: removed expired connection for obelix-61184 (113290 ms old) from recv_table |
WARN 11:02:57,699 [OOB-5,obelix-21216] FD | I was suspected by obelix-61184; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK |
2012-07-20 11:02:57,699 WARN [FD] (OOB-5,obelix-21216) I was suspected by obelix-61184; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
WARN 11:02:57,700 [Incoming-2,obelix-21216] GMS | obelix-21216: not member of view [obelix-27398|3]; discarding it |
2012-07-20 11:02:57,700 WARN [GMS] (Incoming-2,obelix-21216) obelix-21216: not member of view [obelix-27398|3]; discarding it
DEBUG 11:02:57,700 [OOB-14,obelix-21216] STABLE | obelix-21216: received digest from obelix-27398 (digest=obelix-61184: [0 (0)], obelix-27398: [5 (5)]) which does not match my own digest (obelix-61184: [0 (0)], obelix-21216: [0 (0)], obelix-27398: [4 (4)]): ignoring digest and re-initializing own digest |
DEBUG 11:02:57,704 [OOB-2,obelix-21216] STABLE | obelix-21216: received digest from obelix-61184 (digest=obelix-61184: [0 (0)], obelix-27398: [5 (5)]) which does not match my own digest (obelix-61184: [0 (0)], obelix-21216: [0 (0)], obelix-27398: [5 (5)]): ignoring digest and re-initializing own digest |
DEBUG 11:03:11,111 [OOB-7,obelix-21216] STABLE | obelix-21216: received digest from obelix-61184 (digest=obelix-61184: [0 (0)], obelix-27398: [5 (5)]) which does not match my own digest (obelix-61184: [0 (0)], obelix-21216: [0 (0)], obelix-27398: [5 (5)]): ignoring digest and re-initializing own digest |
...
DEBUG 11:04:12,702 [Timer-3,obelix-21216] FD | sending are-you-alive msg to obelix-27398 (own address=obelix-21216) |
DEBUG 11:04:24,194 [OOB-17,obelix-21216] STABLE | obelix-21216: received digest from obelix-61184 (digest=obelix-61184: [0 (0)], obelix-27398: [5 (5)]) which does not match my own digest (obelix-61184: [0 (0)], obelix-21216: [0 (0)], obelix-27398: [5 (5)]): ignoring digest and re-initializing own digest |
...
The node keeps up running, still thinking there were additional members. :-/
My current jgroups configuration to this is attached. This test was executed by 5.2.0 Alpha and the containing JGroups 3.1
Any ideas?
-
jgroups-udp.xml.zip 1.5 KB