7 Replies Latest reply on Oct 11, 2006 2:24 AM by emailmsgbox

    Cluster Membership after Network Failure

    dfisher

      I'm using version 4.0.4 and I can't seem to get my cluster configuration right.
      I have 2 nodes each using the TCP config:

       <Config>
       <TCP bind_addr="X.X.X.1" start_port="7800" loopback="true" conn_expire_time="5000"/>
       <TCPPING initial_hosts="X.X.X.1[7800],X.X.X.2[7800]" port_range="1" timeout="3500"
       num_initial_members="2" up_thread="true" down_thread="true"/>
       <MERGE2 min_interval="5000" max_interval="10000"/>
       <FD_SOCK down_thread="false" up_thread="false"/>
       <FD timeout="2500" shun="true" max_tries="5" up_thread="false" down_thread="false" />
       <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
       <pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
       retransmit_timeout="3000"/>
       <pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
       print_local_addr="true" down_thread="true" up_thread="true"/>
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
       </Config>
      


      If I pull the network cable from one of the nodes, wait a minute, then plug it back in, the cluster membership is never rebuilt on both nodes.
      At that point farming doesn't work and I have to restart one of the nodes.

      Here is a snippet of a consolidated server log:

      node-1 2006-09-22 11:18:32,100 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] Suspected member: node-2:7800 (additional data: 17 bytes)
      node-2 2006-09-22 11:18:32,203 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Suspected member: node-1:7800 (additional data: 17 bytes)
      node-2 2006-09-22 11:18:32,212 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 4, delta: -1) : [X.X.X.2:-1]
      node-2 2006-09-22 11:18:32,216 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (X.X.X.2:-1) received membershipChanged event:
      node-2 2006-09-22 11:18:32,217 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead members: 1 ([X.X.X.1:-1])
      node-2 2006-09-22 11:18:32,217 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members : 0 ([])
      node-2 2006-09-22 11:18:32,218 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 1 ([X.X.X.2:-1])
      node-1 2006-09-22 11:18:34,633 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view for partition DefaultPartition (id: 4, delta: -1) : [X.X.X.1:-1]
      node-1 2006-09-22 11:18:34,634 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] I am (X.X.X.1:-1) received membershipChanged event:
      node-1 2006-09-22 11:18:34,635 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] Dead members: 1 ([X.X.X.2:-1])
      node-1 2006-09-22 11:18:34,635 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] New Members : 0 ([])
      node-1 2006-09-22 11:18:34,635 INFO [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.DefaultPartition] All Members : 1 ([X.X.X.1:-1])
      node-2 2006-09-22 11:18:34,892 INFO [org.jboss.cache.TreeCache] viewAccepted(): [node-2:7810|2] [node-2:7810]
      node-1 2006-09-22 11:18:36,139 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] Suspected member: node-2:7800 (additional data: 17 bytes)
      node-1 2006-09-22 11:23:52,531 INFO [org.jboss.cache.TreeCache] viewAccepted(): [node-1:7810|2] [node-1:7810]
      node-2 2006-09-22 11:24:05,025 INFO [org.jboss.cache.TreeCache] viewAccepted(): [node-2:7810|0] [node-2:7810]
      node-2 2006-09-22 11:24:05,025 INFO [org.jboss.cache.TreeCache] new cache is null (may be first member in cluster)
      node-1 2006-09-22 11:24:05,059 INFO [org.jboss.cache.TreeCache] viewAccepted(): [node-1:7810|0] [node-1:7810]
      node-1 2006-09-22 11:24:05,059 INFO [org.jboss.cache.TreeCache] new cache is null (may be first member in cluster)


      And here is a snippet of the jgroups log on node-1:


      2006-09-22 11:18:15,537 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node-2:7810 (own address=node-1:7810)
      2006-09-22 11:18:15,541 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node-2:7800 (additional data: 17 bytes) (own address=node-1:7800 (additional data: 17 bytes))
      2006-09-22 11:18:15,541 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node-2:7800 (additional data: 17 bytes) (number=0)
      2006-09-22 11:18:16,365 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
      2006-09-22 11:18:19,149 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7810: [0 : 9 (9)], node-2:7810: [0 : 4 (4)]] (num_gossip_runs=1, max_gossip_runs=3)
      2006-09-22 11:18:19,150 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task terminating (num_gossip_runs=0, max_gossip_runs=3)
      2006-09-22 11:18:25,166 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#19 (19), node-2:7800 (additional data: 17 bytes)#87 (87) from node-1:7800 (additional data: 17 bytes) 2006-09-22 11:18:28,082 DEBUG [org.jgroups.protocols.FD] [node-1:7800 (additional data: 17 bytes)]: received no heartbeat ack from node-2:7800 (additional data: 17 bytes) for 6 times (15000 milliseconds), suspecting it
      2006-09-22 11:18:28,082 DEBUG [org.jgroups.protocols.FD] mbr=node-2:7800 (additional data: 17 bytes) (size=1)
      2006-09-22 11:18:30,586 DEBUG [org.jgroups.protocols.FD] mbr=node-2:7810 (size=1)
      2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node-2:7800 (additional data: 17 bytes) (own address=node-1:7800 (additional data: 17 bytes))
      2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node-2:7800 (additional data: 17 bytes) (number=0)
      2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7800 (additional data: 17 bytes)]] to group
      2006-09-22 11:18:30,590 DEBUG [org.jgroups.protocols.FD] task done
      2006-09-22 11:18:30,591 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7800 (additional data: 17 bytes)], from=node-1:7800 (additional data: 17 bytes))]
      2006-09-22 11:18:32,098 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=node-2:7800 (additional data: 17 bytes)
      2006-09-22 11:18:32,098 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
      2006-09-22 11:18:32,099 DEBUG [org.jgroups.protocols.pbcast.GMS] VID=4, current members=(node-1:7800 (additional data: 17 bytes), node-2:7800 (additional data: 17 bytes)), new_mbrs=(), old_mbrs=(), suspected_mbrs=(
      node-2:7800 (additional data: 17 bytes))
      2006-09-22 11:18:32,099 DEBUG [org.jgroups.protocols.pbcast.GMS] new view is [node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 bytes)]
      2006-09-22 11:18:32,099 DEBUG [org.jgroups.protocols.pbcast.GMS] mcasting view {[node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 bytes)]} (1 mbrs)

      2006-09-22 11:18:32,099 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7800 (additional data: 17 bytes)
      2006-09-22 11:18:33,098 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7800 (additional data: 17 bytes), node-2:7800 (additional data: 17 bytes)], pingable_mbrs=[node-1:7800 (additional d
      ata: 17 bytes)], local_addr=node-1:7800 (additional data: 17 bytes)
      2006-09-22 11:18:34,631 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] view=[node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 bytes)]
      2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node-1:7800 (additional data: 17 bytes)] view is [node-1:7800 (additional data: 17 bytes)|4] [node-1:7800 (additional data: 17 byte
      s)]
      2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
      2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.pbcast.NAKACK] removing node-2:7800 (additional data: 17 bytes) from received_msgs (not member anymore)
      2006-09-22 11:18:34,632 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [node-2:7800 (additional data: 17 bytes)], after adjustment: [], stopped: true
      2006-09-22 11:18:34,633 DEBUG [org.jgroups.protocols.FD_SOCK] VIEW_CHANGE received: [node-1:7800 (additional data: 17 bytes)]
      2006-09-22 11:18:34,634 DEBUG [org.jgroups.protocols.FD_SOCK] socket to null was reset
      2006-09-22 11:18:34,634 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated
      2006-09-22 11:18:36,138 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr node-2:7800 (additional data: 17 bytes) is not a member !
      2006-09-22 11:18:36,139 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7800 (additional data: 17 bytes)
      2006-09-22 11:18:38,818 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7800 (additional data: 17 bytes): [0 : 21 (21)]] (num_gossip_runs=3, max_gossip_runs=3)
      2006-09-22 11:18:38,819 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#21 (21) from node-1:7800 (additional data: 17 bytes)
      2006-09-22 11:18:38,819 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7800 (additional data: 17 bytes)#21 (21)
      2006-09-22 11:18:38,819 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 270
      2006-09-22 11:18:39,098 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7800 (additional data: 17 bytes)#21]
      2006-09-22 11:18:39,099 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false)
      2006-09-22 11:18:39,099 DEBUG [org.jgroups.protocols.pbcast.NAKACK] received digest [node-1:7800 (additional data: 17 bytes): [-1 : 21 (21)]]
      2006-09-22 11:22:58,567 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7810#9 (9), node-2:7810#4 (4) from node-1:7810
      2006-09-22 11:22:58,571 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
      2006-09-22 11:22:58,580 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
      2006-09-22 11:22:58,581 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
      2006-09-22 11:22:58,581 DEBUG [org.jgroups.protocols.FD] task done
      2006-09-22 11:22:58,581 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
      2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=node-2:7810
      2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
      2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.GMS] VID=2, current members=(node-1:7810, node-2:7810), new_mbrs=(), old_mbrs=(), suspected_mbrs=(node-2:7810)
      2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.GMS] new view is [node-1:7810|2] [node-1:7810]
      2006-09-22 11:23:00,076 DEBUG [org.jgroups.protocols.pbcast.GMS] mcasting view {[node-1:7810|2] [node-1:7810]} (1 mbrs)
      2006-09-22 11:23:00,077 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7810
      2006-09-22 11:23:01,084 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
      2006-09-22 11:23:01,084 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
      2006-09-22 11:23:01,084 DEBUG [org.jgroups.protocols.FD] task done
      2006-09-22 11:23:04,540 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
      2006-09-22 11:23:26,158 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#23 (23) from node-1:7800 (additional data: 17 bytes)
      2006-09-22 11:23:26,159 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7800 (additional data: 17 bytes)#23 (23)
      2006-09-22 11:23:26,159 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 4955
      2006-09-22 11:23:26,165 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7800 (additional data: 17 bytes)#23 (24) from node-1:7800 (additional data: 17 bytes)
      2006-09-22 11:23:26,165 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7800 (additional data: 17 bytes)#23 (24)
      2006-09-22 11:23:26,165 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=org.jgroups.protocols.pbcast.STABLE$StabilitySendTask@d1ebcd, delay is 5216
      2006-09-22 11:23:28,629 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
      2006-09-22 11:23:28,629 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
      2006-09-22 11:23:28,630 DEBUG [org.jgroups.protocols.FD] task done
      2006-09-22 11:23:31,118 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7800 (additional data: 17 bytes)#23]
      2006-09-22 11:23:31,118 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false) 2006-09-22 11:23:31,118 DEBUG [org.jgroups.protocols.pbcast.NAKACK] received digest [node-1:7800 (additional data: 17 bytes): [-1 : 23 (23)]]
      2006-09-22 11:23:33,045 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[] 2006-09-22 11:23:33,570 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7810: [0 : 10 (11)], node-2:7810: [0 : 4 (4)]] (num_gossip_runs=3, max_gossip_runs=3)
      2006-09-22 11:23:33,637 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
      2006-09-22 11:23:33,637 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
      2006-09-22 11:23:33,638 DEBUG [org.jgroups.protocols.FD] task done
      2006-09-22 11:23:38,098 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
      2006-09-22 11:23:44,754 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
      2006-09-22 11:23:46,158 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
      2006-09-22 11:23:46,158 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
      2006-09-22 11:23:46,158 DEBUG [org.jgroups.protocols.FD] task done
      2006-09-22 11:23:48,666 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
      2006-09-22 11:23:48,667 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
      2006-09-22 11:23:48,667 DEBUG [org.jgroups.protocols.FD] task done
      2006-09-22 11:23:49,514 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
      2006-09-22 11:23:51,174 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[node-1:7810, node-2:7810], pingable_mbrs=[node-1:7810], local_addr=node-1:7810
      2006-09-22 11:23:51,174 DEBUG [org.jgroups.protocols.FD] broadcasting SUSPECT message [suspected_mbrs=[node-2:7810]] to group
      2006-09-22 11:23:51,175 DEBUG [org.jgroups.protocols.FD] task done
      2006-09-22 11:23:52,443 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
      006-09-22 11:23:52,530 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] view=[node-1:7810|2] [node-1:7810]
      2006-09-22 11:23:52,530 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node-1:7810] view is [node-1:7810|2] [node-1:7810]
      2006-09-22 11:23:52,531 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
      2006-09-22 11:23:52,531 DEBUG [org.jgroups.protocols.pbcast.NAKACK] removing node-2:7810 from received_msgs (not member anymore)
      2006-09-22 11:23:52,531 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [node-2:7810], after adjustment: [], stopped: true
      2006-09-22 11:23:52,534 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
      2006-09-22 11:23:52,549 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7810#10 (11), node-2:7810#4 (4) from node-1:7810
      2006-09-22 11:23:52,549 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7810#10 (11), node-2:7810#4 (4)
      2006-09-22 11:23:52,549 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 141
      2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
      2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
      2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
      2006-09-22 11:23:52,553 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-2:7810], from=node-1:7810)]
      2006-09-22 11:23:52,699 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7810#10, node-2:7810#4]
      2006-09-22 11:23:52,699 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false)
      2006-09-22 11:23:52,699 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest (digest=[node-1:7810: [-1 : 10 (11)], node-2:7810: [-1 : 4 (4)]]) which does not match my own digest ([node-1:7810: [-1 : -1]): ignoring digest and re-initializing own digest
      2006-09-22 11:23:53,950 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr=node-2:7810
      2006-09-22 11:23:53,951 ERROR [org.jgroups.protocols.pbcast.CoordGmsImpl] mbr node-2:7810 is not a member !
      2006-09-22 11:23:53,951 DEBUG [org.jgroups.blocks.RequestCorrelator] suspect=node-2:7810
      2006-09-22 11:23:57,443 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[]
      2006-09-22 11:23:59,535 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
      2006-09-22 11:24:00,963 DEBUG [org.jgroups.protocols.FD] node-2:7810 is not in [node-1:7810] ! Telling it to leave group
      2006-09-22 11:24:00,963 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node-1:7810], from=node-2:7810)]
      2006-09-22 11:24:00,963 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
      2006-09-22 11:24:00,976 DEBUG [org.jgroups.protocols.FD] [NOT_MEMBER] I'm being shunned; exiting
      2006-09-22 11:24:00,979 WARN [org.jgroups.protocols.pbcast.NAKACK] [node-1:7810] discarded message from non-member node-2:7810
      2006-09-22 11:24:00,980 DEBUG [org.jgroups.protocols.pbcast.NAKACK] contents for node-1:7810:
      sent_msgs: [0 - 13]
      received_msgs:
      node-1:7810: received_msgs: [], delivered_msgs: [0 - 13]
      2006-09-22 11:24:01,492 DEBUG [org.jgroups.protocols.pbcast.GMS] changed role to org.jgroups.protocols.pbcast.ClientGmsImpl
      2006-09-22 11:24:05,055 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] initial_mbrs are []
      2006-09-22 11:24:05,055 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] no initial members discovered: creating group as first member
      2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node-1:7810] view is [node-1:7810|0] [node-1:7810]
      2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
      2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.GMS] node-1:7810 changed role to org.jgroups.protocols.pbcast.CoordGmsImpl
      2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.GMS] node-1:7810 changed role to org.jgroups.protocols.pbcast.CoordGmsImpl
      2006-09-22 11:24:05,056 DEBUG [org.jgroups.protocols.pbcast.ClientGmsImpl] created group (first member). My view is [node-1:7810|0], impl is org.jgroups.protocols.pbcast.CoordGmsImpl
      2006-09-22 11:24:05,057 DEBUG [org.jgroups.protocols.FD] suspected_mbrs: [], after adjustment: [], stopped: true
      2006-09-22 11:24:05,058 DEBUG [org.jgroups.protocols.MERGE2] merge task started
      2006-09-22 11:24:05,058 DEBUG [org.jgroups.protocols.pbcast.STATE_TRANSFER] GET_STATE: first member (no state)
      2006-09-22 11:24:12,723 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
      2006-09-22 11:24:18,476 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7810, coord_addr=node-2:7810]]
      2006-09-22 11:24:25,524 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
      2006-09-22 11:24:28,436 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7810, coord_addr=node-2:7810]]
      2006-09-22 11:24:35,345 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7800 (additional data: 17 bytes), coord_addr=node-2:7800 (additional data: 17 bytes)]]
      2006-09-22 11:24:38,985 DEBUG [org.jgroups.protocols.pbcast.STABLE] mcasting digest [node-1:7810: [0 : 0] (num_gossip_runs=3, max_gossip_runs=3)
      2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=node-2:7810, coord_addr=node-2:7810]]
      2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.pbcast.STABLE] received digest node-1:7810#0 (-1) from node-1:7810
      2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.pbcast.STABLE] sending stability msg node-1:7810#0 (-1)
      2006-09-22 11:24:39,253 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability_task=null, delay is 502
      2006-09-22 11:24:39,765 DEBUG [org.jgroups.protocols.pbcast.STABLE] stability vector is [node-1:7810#0]
      2006-09-22 11:24:39,765 DEBUG [org.jgroups.protocols.pbcast.STABLE] cancelling stability task (running=false)
      2006-09-22 11:24:39,765 DEBUG [org.jgroups.protocols.pbcast.NAKACK] received digest [node-1:7810: [-1 : 0]


      I have tried the FD config with and without shun, neither option results in the cluster membership being updated.
      Any ideas on what I am doing wrong?
      Thanks.

        • 1. Re: Cluster Membership after Network Failure
          belaban

          Set shun=false in both FD and GMS, and try this with 2.4 CR2

          • 2. Re: Cluster Membership after Network Failure
            dfisher

            I downloaded JGroups 2.4 CR2 and replaced the JBoss jgroups jar with the jgroups-all jar.
            I now get this Exception when the node-2 joins the Partition:


            2006-09-25 15:59:59,426 WARN [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] failed unserializing message buffer (msg=[dst: <null>, src: X.X.X.2:7800 (2 headers), size = 304 bytes])
            java.io.StreamCorruptedException: invalid stream header
            at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:737)
            at java.io.ObjectInputStream.<init>(ObjectInputStream.java:253)
            at org.jboss.invocation.MarshalledValueInputStream.<init>(MarshalledValueInpu tStream.java:74)
            at org.jboss.ha.framework.server.HAPartitionImpl.objectFromByteBuffer(HAParti tionImpl.java:144)
            at org.jboss.ha.framework.server.HAPartitionImpl.handle(HAPartitionImpl.java: 967)
            at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java: 623)
            at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java :508)
            at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:331)
            at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher. java:763)
            at org.jgroups.JChannel.up(JChannel.java:1078)
            at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:377)
            at org.jgroups.stack.ProtocolStack.receiveUpEvent(ProtocolStack.java:393)
            at org.jgroups.stack.Protocol.passUp(Protocol.java:538)
            at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:163)
            at org.jgroups.stack.UpHandler.run(Protocol.java:60)


            Is there something else I need to do to upgrade JGroups?
            Copying the concurrent jar that ships with JGroups caused a whole slew of new exception.
            Thanks.


            • 3. Re: Cluster Membership after Network Failure
              belaban

              Get the latest (= CVS head) JGroups. Bugs
              http://jira.jboss.com/jira/browse/JGRP-217 and
              http://jira.jboss.com/jira/browse/JGRP-304 fix that issue you're seeing.

              • 4. Re: Cluster Membership after Network Failure
                dfisher

                I checked out HEAD and ran 'ant jgroups-core.jar'
                This jar produces the same exception I previously posted.
                Is this the correct checkout?


                cvs -z3 -d:pserver:anonymous@javagroups.cvs.sourceforge.net:/cvsroot/javagroups co -P -r HEAD JGroups



                • 5. Re: Cluster Membership after Network Failure
                  belaban

                  Oops, this is probably a bug in JBoss 4.0.3 that we fixed. Check with the JBossAS folks (on the forum) and set jgroups.marshalling.compatible to true, as listed in http://wiki.jboss.org/wiki/Wiki.jsp?page=SystemProps

                  • 6. Re: Cluster Membership after Network Failure
                    dfisher

                    Setting jgroups.marshalling.compatible=true did the trick.
                    Cluster membership now recovers as expected.
                    Thanks for all your help.

                    • 7. Re: Cluster Membership after Network Failure
                      emailmsgbox

                      I tried using -Djgroups.marshalling.compatible=true on jgroups 2.4rc2
                      but it did not work
                      is it on the latest build only?