7 Replies Latest reply on Oct 11, 2006 2:24 AM by emailmsgbox

Cluster Membership after Network Failure

dfisher Sep 22, 2006 1:06 PM

I'm using version 4.0.4 and I can't seem to get my cluster configuration right.
I have 2 nodes each using the TCP config:

 <Config>
 <TCP bind_addr="X.X.X.1" start_port="7800" loopback="true" conn_expire_time="5000"/>
 <TCPPING initial_hosts="X.X.X.1[7800],X.X.X.2[7800]" port_range="1" timeout="3500"
 num_initial_members="2" up_thread="true" down_thread="true"/>
 <MERGE2 min_interval="5000" max_interval="10000"/>
 <FD_SOCK down_thread="false" up_thread="false"/>
 <FD timeout="2500" shun="true" max_tries="5" up_thread="false" down_thread="false" />
 <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
 <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
 retransmit_timeout="3000"/>
 <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
 print_local_addr="true" down_thread="true" up_thread="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 </Config>

If I pull the network cable from one of the nodes, wait a minute, then plug it back in, the cluster membership is never rebuilt on both nodes.
At that point farming doesn't work and I have to restart one of the nodes.

Here is a snippet of a consolidated server log:

node-1 2006-09-22 11:18:32,100 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] Suspected member: node-2:7800 (additional data: 17 bytes)
node-2 2006-09-22 11:18:32,203 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Suspected member: node-1:7800 (additional data: 17 bytes)
node-2 2006-09-22 11:18:32,212 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 4, delta: -1) : [X.X.X.2:-1]
node-2 2006-09-22 11:18:32,216 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (X.X.X.2:-1) received membershipChanged event:
node-2 2006-09-22 11:18:32,217 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead members: 1 ([X.X.X.1:-1])
node-2 2006-09-22 11:18:32,217 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members : 0 ([])
node-2 2006-09-22 11:18:32,218 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 1 ([X.X.X.2:-1])
node-1 2006-09-22 11:18:34,633 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 4, delta: -1) : [X.X.X.1:-1]
node-1 2006-09-22 11:18:34,634 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (X.X.X.1:-1) received membershipChanged event:
node-1 2006-09-22 11:18:34,635 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead members: 1 ([X.X.X.2:-1])
node-1 2006-09-22 11:18:34,635 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members : 0 ([])
node-1 2006-09-22 11:18:34,635 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 1 ([X.X.X.1:-1])
node-2 2006-09-22 11:18:34,892 INFO [org.jboss.cache.TreeCache] viewAccepted(): [node-2:7810|2] [node-2:7810]
node-1 2006-09-22 11:18:36,139 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] Suspected member: node-2:7800 (additional data: 17 bytes)
node-1 2006-09-22 11:23:52,531 INFO [org.jboss.cache.TreeCache] viewAccepted(): [node-1:7810|2] [node-1:7810]
node-2 2006-09-22 11:24:05,025 INFO [org.jboss.cache.TreeCache] viewAccepted(): [node-2:7810|0] [node-2:7810]
node-2 2006-09-22 11:24:05,025 INFO [org.jboss.cache.TreeCache] new cache is null (may be first member in cluster)
node-1 2006-09-22 11:24:05,059 INFO [org.jboss.cache.TreeCache] viewAccepted(): [node-1:7810|0] [node-1:7810]
node-1 2006-09-22 11:24:05,059 INFO [org.jboss.cache.TreeCache] new cache is null (may be first member in cluster)

And here is a snippet of the jgroups log on node-1:

2006-09-22 11:18:15,537 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node-2:7810 (own address=node-1:7810)
2006-09-22 11:18:15,541 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node-2:7800 (additional data: 17 bytes) (own address=node-1:7800 (additional data: 17 bytes))
2006-09-22 11:18:15,541 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node-2:7800 (additional data: 17 bytes) (number=0)
2006-09-22 11:18:16,365 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
2006-09-22 11:18:19,149 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7810: [0 : 9 (9)], node-2:7810: [0 : 4 (4)]] (num_gossip_runs=1, max_gossip_runs=3)
2006-09-22 11:18:19,150 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task terminating (num_gossip_runs=0, max_gossip_runs=3)
2006-09-22 11:18:25,166 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#19 (19), node-2:7800 (additional data: 17 bytes)#87 (87) from node-1:7800 (additional data: 17 bytes) 2006-09-22 11:18:28,082 DEBUG [org.jgroups.protocols.FD] [node-1:7800 (additional data: 17 bytes)]: received no heartbeat ack from node-2:7800 (additional data: 17 bytes) for 6 times (15000 milliseconds), suspecting it
2006-09-22 11:18:28,082 DEBUG [org.jgroups.protocols.FD] mbr=node-2:7800 (additional data: 17 bytes) (size=1)
2006-09-22 11:18:30,586 DEBUG [org.jgroups.protocols.FD] mbr=node-2:7810 (size=1)
2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node-2:7800 (additional data: 17 bytes) (own address=node-1:7800 (additional data: 17 bytes))
2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node-2:7800 (additional data: 17 bytes) (number=0)
2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7800 (additional data: 17 bytes)]] to group
2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] task done
2006-09-22 11:18:30,591 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7800 (additional data: 17 bytes)], from=node-1:7800 (additional data: 17 bytes))]
2006-09-22 11:18:32,098 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=node-2:7800 (additional data: 17 bytes)
2006-09-22 11:18:32,098 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
2006-09-22 11:18:32,099 DEBUG [org.jgroups.protocols.pbcast.GMS] VID=4, current members=(node-1:7800 (additional data: 17 bytes), node-2:7800 (additional data: 17 bytes)), new_mbrs=(), old_mbrs=(), suspected_mbrs=(
node-2:7800 (additional data: 17 bytes))
2006-09-22 11:18:32,099 DEBUG [org.jgroups.protocols.pbcast.GMS] new view is [node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 bytes)]
2006-09-22 11:18:32,099 DEBUG [org.jgroups.protocols.pbcast.GMS] mcasting view {[node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 bytes)]} (1 mbrs)

2006-09-22 11:18:32,099 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7800 (additional data: 17 bytes)
2006-09-22 11:18:33,098 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7800 (additional data: 17 bytes), node-2:7800 (additional data: 17 bytes)], pingable_mbrs=[node-1:7800 (additional d
ata: 17 bytes)], local_addr=node-1:7800 (additional data: 17 bytes)
2006-09-22 11:18:34,631 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] view=[node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 bytes)]
2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node-1:7800 (additional data: 17 bytes)] view is [node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 byte
s)]
2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.pbcast.NAKACK] removing node-2:7800 (additional data: 17 bytes) from received_msgs (not member anymore)
2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [node-2:7800 (additional data: 17 bytes)], after adjustment: [], stopped: true
2006-09-22 11:18:34,633 DEBUG [org.jgroups.protocols.FD_SOCK] VIEW_CHANGE received: [node-1:7800 (additional data: 17 bytes)]
2006-09-22 11:18:34,634 DEBUG [org.jgroups.protocols.FD_SOCK] socket to null was reset
2006-09-22 11:18:34,634 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated
2006-09-22 11:18:36,138 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr node-2:7800 (additional data: 17 bytes) is not a member !
2006-09-22 11:18:36,139 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7800 (additional data: 17 bytes)
2006-09-22 11:18:38,818 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7800 (additional data: 17 bytes): [0 : 21 (21)]] (num_gossip_runs=3, max_gossip_runs=3)
2006-09-22 11:18:38,819 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#21 (21) from node-1:7800 (additional data: 17 bytes)
2006-09-22 11:18:38,819 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7800 (additional data: 17 bytes)#21 (21)
2006-09-22 11:18:38,819 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 270
2006-09-22 11:18:39,098 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7800 (additional data: 17 bytes)#21]
2006-09-22 11:18:39,099 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false)
2006-09-22 11:18:39,099 DEBUG [org.jgroups.protocols.pbcast.NAKACK] received digest [node-1:7800 (additional data: 17 bytes): [-1 : 21 (21)]]
2006-09-22 11:22:58,567 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7810#9 (9), node-2:7810#4 (4) from node-1:7810
2006-09-22 11:22:58,571 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
2006-09-22 11:22:58,580 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
2006-09-22 11:22:58,581 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
2006-09-22 11:22:58,581 DEBUG [org.jgroups.protocols.FD] task done
2006-09-22 11:22:58,581 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=node-2:7810
2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.GMS] VID=2, current members=(node-1:7810, node-2:7810), new_mbrs=(), old_mbrs=(), suspected_mbrs=(node-2:7810)
2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.GMS] new view is [node-1:7810|2] [node-1:7810]
2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.GMS] mcasting view {[node-1:7810|2] [node-1:7810]} (1 mbrs)
2006-09-22 11:23:00,077 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7810
2006-09-22 11:23:01,084 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
2006-09-22 11:23:01,084 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
2006-09-22 11:23:01,084 DEBUG [org.jgroups.protocols.FD] task done
2006-09-22 11:23:04,540 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
2006-09-22 11:23:26,158 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#23 (23) from node-1:7800 (additional data: 17 bytes)
2006-09-22 11:23:26,159 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7800 (additional data: 17 bytes)#23 (23)
2006-09-22 11:23:26,159 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 4955
2006-09-22 11:23:26,165 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#23 (24) from node-1:7800 (additional data: 17 bytes)
2006-09-22 11:23:26,165 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7800 (additional data: 17 bytes)#23 (24)
2006-09-22 11:23:26,165 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=org.jgroups.protocols.pbcast.STABLE$StabilitySendTask@d1ebcd, delay is 5216
2006-09-22 11:23:28,629 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
2006-09-22 11:23:28,629 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
2006-09-22 11:23:28,630 DEBUG [org.jgroups.protocols.FD] task done
2006-09-22 11:23:31,118 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7800 (additional data: 17 bytes)#23]
2006-09-22 11:23:31,118 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false) 2006-09-22 11:23:31,118 DEBUG [org.jgroups.protocols.pbcast.NAKACK] received digest [node-1:7800 (additional data: 17 bytes): [-1 : 23 (23)]]
2006-09-22 11:23:33,045 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[] 2006-09-22 11:23:33,570 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7810: [0 : 10 (11)], node-2:7810: [0 : 4 (4)]] (num_gossip_runs=3, max_gossip_runs=3)
2006-09-22 11:23:33,637 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
2006-09-22 11:23:33,637 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
2006-09-22 11:23:33,638 DEBUG [org.jgroups.protocols.FD] task done
2006-09-22 11:23:38,098 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
2006-09-22 11:23:44,754 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
2006-09-22 11:23:46,158 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
2006-09-22 11:23:46,158 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
2006-09-22 11:23:46,158 DEBUG [org.jgroups.protocols.FD] task done
2006-09-22 11:23:48,666 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
2006-09-22 11:23:48,667 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
2006-09-22 11:23:48,667 DEBUG [org.jgroups.protocols.FD] task done
2006-09-22 11:23:49,514 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
2006-09-22 11:23:51,174 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
2006-09-22 11:23:51,174 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
2006-09-22 11:23:51,175 DEBUG [org.jgroups.protocols.FD] task done
2006-09-22 11:23:52,443 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
006-09-22 11:23:52,530 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] view=[node-1:7810|2] [node-1:7810]
2006-09-22 11:23:52,530 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node-1:7810] view is [node-1:7810|2] [node-1:7810]
2006-09-22 11:23:52,531 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
2006-09-22 11:23:52,531 DEBUG [org.jgroups.protocols.pbcast.NAKACK] removing node-2:7810 from received_msgs (not member anymore)
2006-09-22 11:23:52,531 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [node-2:7810], after adjustment: [], stopped: true
2006-09-22 11:23:52,534 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
2006-09-22 11:23:52,549 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7810#10 (11), node-2:7810#4 (4) from node-1:7810
2006-09-22 11:23:52,549 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7810#10 (11), node-2:7810#4 (4)
2006-09-22 11:23:52,549 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 141
2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
2006-09-22 11:23:52,699 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7810#10, node-2:7810#4]
2006-09-22 11:23:52,699 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false)
2006-09-22 11:23:52,699 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest (digest=[node-1:7810: [-1 : 10 (11)], node-2:7810: [-1 : 4 (4)]]) which does not match my own digest ([node-1:7810: [-1 : -1]): ignoring digest and re-initializing own digest
2006-09-22 11:23:53,950 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=node-2:7810
2006-09-22 11:23:53,951 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr node-2:7810 is not a member !
2006-09-22 11:23:53,951 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7810
2006-09-22 11:23:57,443 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
2006-09-22 11:23:59,535 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
2006-09-22 11:24:00,963 DEBUG [org.jgroups.protocols.FD] node-2:7810 is not in [node-1:7810] ! Telling it to leave group
2006-09-22 11:24:00,963 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-1:7810], from=node-2:7810)]
2006-09-22 11:24:00,963 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
2006-09-22 11:24:00,976 DEBUG [org.jgroups.protocols.FD] [NOT_MEMBER] I'm being shunned; exiting
2006-09-22 11:24:00,979 WARN [org.jgroups.protocols.pbcast.NAKACK] [node-1:7810] discarded message from non-member node-2:7810
2006-09-22 11:24:00,980 DEBUG [org.jgroups.protocols.pbcast.NAKACK] contents for node-1:7810:
sent_msgs: [0 - 13]
received_msgs:
node-1:7810: received_msgs: [], delivered_msgs: [0 - 13]
2006-09-22 11:24:01,492 DEBUG [org.jgroups.protocols.pbcast.GMS] changed role to org.jgroups.protocols.pbcast.ClientGmsImpl
2006-09-22 11:24:05,055 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are []
2006-09-22 11:24:05,055 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] no initial members discovered: creating group as first member
2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node-1:7810] view is [node-1:7810|0] [node-1:7810]
2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.GMS] node-1:7810 changed role to org.jgroups.protocols.pbcast.CoordGmsImpl
2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.GMS] node-1:7810 changed role to org.jgroups.protocols.pbcast.CoordGmsImpl
2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] created group (first member). My view is [node-1:7810|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
2006-09-22 11:24:05,057 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [], after adjustment: [], stopped: true
2006-09-22 11:24:05,058 DEBUG [org.jgroups.protocols.MERGE2] merge task started
2006-09-22 11:24:05,058 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] GET_STATE: first member (no state)
2006-09-22 11:24:12,723 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
2006-09-22 11:24:18,476 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7810, coord_addr=node-2:7810]]
2006-09-22 11:24:25,524 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
2006-09-22 11:24:28,436 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7810, coord_addr=node-2:7810]]
2006-09-22 11:24:35,345 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
2006-09-22 11:24:38,985 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7810: [0 : 0] (num_gossip_runs=3, max_gossip_runs=3)
2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7810, coord_addr=node-2:7810]]
2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7810#0 (-1) from node-1:7810
2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7810#0 (-1)
2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 502
2006-09-22 11:24:39,765 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7810#0]
2006-09-22 11:24:39,765 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false)
2006-09-22 11:24:39,765 DEBUG [org.jgroups.protocols.pbcast.NAKACK] received digest [node-1:7810: [-1 : 0]

I have tried the FD config with and without shun, neither option results in the cluster membership being updated.
Any ideas on what I am doing wrong?
Thanks.

1. Re: Cluster Membership after Network Failure

belaban Sep 23, 2006 2:27 AM (in response to dfisher)

Set shun=false in both FD and GMS, and try this with 2.4 CR2
Actions
2. Re: Cluster Membership after Network Failure

dfisher Sep 25, 2006 4:24 PM (in response to dfisher)

I downloaded JGroups 2.4 CR2 and replaced the JBoss jgroups jar with the jgroups-all jar.
I now get this Exception when the node-2 joins the Partition:

2006-09-25 15:59:59,426 WARN [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] failed unserializing message buffer (msg=[dst: <null>, src: X.X.X.2:7800 (2 headers), size = 304 bytes])
java.io.StreamCorruptedException: invalid stream header
at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:737)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:253)
at org.jboss.invocation.MarshalledValueInputStream.<init>(MarshalledValueInpu tStream.java:74)
at org.jboss.ha.framework.server.HAPartitionImpl.objectFromByteBuffer(HAParti tionImpl.java:144)
at org.jboss.ha.framework.server.HAPartitionImpl.handle(HAPartitionImpl.java: 967)
at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java: 623)
at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java :508)
at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:331)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher. java:763)
at org.jgroups.JChannel.up(JChannel.java:1078)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:377)
at org.jgroups.stack.ProtocolStack.receiveUpEvent(ProtocolStack.java:393)
at org.jgroups.stack.Protocol.passUp(Protocol.java:538)
at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:163)
at org.jgroups.stack.UpHandler.run(Protocol.java:60)

Is there something else I need to do to upgrade JGroups?
Copying the concurrent jar that ships with JGroups caused a whole slew of new exception.
Thanks.
Actions
3. Re: Cluster Membership after Network Failure

belaban Sep 25, 2006 6:20 PM (in response to dfisher)

Get the latest (= CVS head) JGroups. Bugs
http://jira.jboss.com/jira/browse/JGRP-217 and
http://jira.jboss.com/jira/browse/JGRP-304 fix that issue you're seeing.
Actions
4. Re: Cluster Membership after Network Failure

dfisher Sep 26, 2006 11:15 AM (in response to dfisher)

I checked out HEAD and ran 'ant jgroups-core.jar'
This jar produces the same exception I previously posted.
Is this the correct checkout?

cvs -z3 -d:pserver:anonymous@javagroups.cvs.sourceforge.net:/cvsroot/javagroups co -P -r HEAD JGroups
Actions
5. Re: Cluster Membership after Network Failure

belaban Sep 27, 2006 4:22 AM (in response to dfisher)

Oops, this is probably a bug in JBoss 4.0.3 that we fixed. Check with the JBossAS folks (on the forum) and set jgroups.marshalling.compatible to true, as listed in http://wiki.jboss.org/wiki/Wiki.jsp?page=SystemProps
Actions
6. Re: Cluster Membership after Network Failure

dfisher Sep 27, 2006 5:15 PM (in response to dfisher)

Setting jgroups.marshalling.compatible=true did the trick.
Cluster membership now recovers as expected.
Thanks for all your help.
Actions
7. Re: Cluster Membership after Network Failure

emailmsgbox Oct 11, 2006 2:24 AM (in response to dfisher)

I tried using -Djgroups.marshalling.compatible=true on jgroups 2.4rc2
but it did not work
is it on the latest build only?
Actions

Go to original post