JBoss AS 6 cluster keeps splitting into separate clusters, problem with jgroup?
stewart_g Nov 12, 2012 6:44 AMHi
I have spent a lot of time searching the community and issues site of jboss.org but I havent found any article on a fix for the split brain issue on AS 6. We recently had a network issue where our cluster, 2 hosts, split into 2 separate clusters and even though messages were re-established between the two, they never merged and formed a singleton. The only way to resolve this was to stop one of the clusters and start it back up at which point it forms the cluster again and becomes a slave.
Has there been any patches or packages released which resolves this issue?
I did find the following during my searches :
https://community.jboss.org/thread/168601
https://issues.jboss.org/browse/JBAS-9456
Seems like no work has been done on them though.
Here is the current jgroups setup I have configured on both JBoss' :
<config>
<UDP
singleton_name="udp"
mcast_port="${jboss.jgroups.udp.mcast_port:45688}"
mcast_addr="${jboss.jgroups.udp.mcast_addr,jboss.partition.udpGroup:228.11.11.11}"
bind_port="${jboss.jgroups.udp.bind_port:55200}"
tos="8"
ucast_recv_buf_size="20000000"
ucast_send_buf_size="640000"
mcast_recv_buf_size="25000000"
mcast_send_buf_size="640000"
loopback="true"
discard_incompatible_packets="true"
enable_bundling="false"
ip_ttl="${jgroups.udp.ip_ttl:2}"
thread_naming_pattern="cl"
timer.num_threads="12"
enable_diagnostics="${jboss.jgroups.enable_diagnostics:true}"
diagnostics_addr="${jboss.jgroups.diagnostics_addr:224.0.75.75}"
diagnostics_port="${jboss.jgroups.diagnostics_port:7500}"
thread_pool.enabled="true"
thread_pool.min_threads="20"
thread_pool.max_threads="200"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="true"
thread_pool.queue_max_size="1000"
thread_pool.rejection_policy="discard"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="20"
oob_thread_pool.max_threads="200"
oob_thread_pool.keep_alive_time="1000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.rejection_policy="discard"/>
<PING timeout="2000" num_initial_members="3"/>
<MERGE2 max_interval="100000" min_interval="20000"/>
<FD_SOCK start_port="${jboss.jgroups.udp.fd_sock_port:54200}"/>
<FD timeout="6000" max_tries="5"/>
<VERIFY_SUSPECT timeout="10000"/>
<BARRIER/>
<pbcast.NAKACK use_mcast_xmit="true" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="400000"/>
<VIEW_SYNC avg_send_interval="10000"/>
<pbcast.GMS print_local_addr="true"
join_timeout="3000"
view_bundling="true"
view_ack_collection_timeout="5000"
resume_task_timeout="7500"/>
<UFC max_credits="2000000" ignore_synchronous_response="true"/>
<MFC max_credits="2000000" ignore_synchronous_response="true"/>
<FRAG2 frag_size="60000"/>
<pbcast.STREAMING_STATE_TRANSFER/>
<!--pbcast.STATE_TRANSFER/-->
<pbcast.FLUSH timeout="0" start_flush_timeout="10000"/>
</config>
And here is the output from the logs when the issue occurs :
SLAVE HOST : JASDALServ3
2012-11-08 18:56:02,128 INFO [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (VERIFY_SUSPECT.TimerThread,VirtualServCluster-HAPartition,JASDalServ3:1099) Suspected member: JASDalServ4:1099
2012-11-08 18:56:02,253 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-2,null) Block processed at JASDalServ3:1099
2012-11-08 18:56:02,925 INFO [org.jboss.ha.framework.server.ClusterPartition.lifecycle.VirtualServCluster] (Incoming-4,null) New cluster view for partition VirtualServCluster (id: 2, delta: -1, merge: false) : [JASDalServ3:1099]
2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) dead members: [JASDalServ4:1099]
2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) membership changed from 2 to 1
2012-11-08 18:56:02,925 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-4,null) Received new cluster view: [JASDalServ3:1099|2] [JASDalServ3:1099]
2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) Unblock processed at JASDalServ3:1099
2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (AsynchViewChangeHandler Thread) Begin notifyListeners, viewID: 2
2012-11-08 18:56:02,940 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) I am (JASDalServ3:1099) received membershipChanged event:
2012-11-08 18:56:02,940 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) Dead members: 1 ([JASDalServ4:1099])
2012-11-08 18:56:02,940 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) New Members : 0 ([])
2012-11-08 18:56:02,940 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) All Members : 1 ([JASDalServ3:1099])
2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) purgeDeadMembers, [JASDalServ4:1099]
2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) election result =true, electionPolicy=org.jboss.ha.singleton.HASingletonElectionPolicySimple
2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=false, viewID=115705935, partition=VirtualServCluster
2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) startNewMaster, isMasterNode=false
2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) election result =true, electionPolicy=DistributedReplicantManager
2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=false, viewID=115705935, partition=VirtualServClus
2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) startNewMaster, isMasterNode=false
2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonController] (AsynchViewChangeHandler Thread) starting singleton, mSingleton=null, mSingletonMBean=TSServices6:service=Mosaic
2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) startSingleton() : elected for master singleton node
2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonController] (AsynchViewChangeHandler Thread) Calling operation: startSingleton(), on target: 'TSServices6:service=Mosaic'
MASTER HOST: JASDalServ4
2012-11-08 18:56:04,138 WARN [org.jgroups.protocols.FD] (OOB-46,null) I was suspected by JASDalServ3:1099; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2012-11-08 18:56:04,154 WARN [org.jgroups.protocols.FD] (OOB-47,null) I was suspected by JASDalServ3:1099; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2012-11-08 18:56:04,169 WARN [org.jgroups.protocols.pbcast.GMS] (OOB-43,null) JASDalServ4:1099: not member of view [JASDalServ3:1099|2] [JASDalServ3:1099]; discarding it
Could anyone shed some light on what is going on here? I dont understand why the SUSPECT message is sent out but < 1 second later it has already decided that it should split and form its own cluster and then never merge back in.
Thanks