1 Reply Latest reply on Feb 13, 2013 7:58 AM by rhusar

JBoss AS 6 cluster keeps splitting into separate clusters, problem with jgroup?

stewart_g Nov 12, 2012 6:44 AM

I have spent a lot of time searching the community and issues site of jboss.org but I havent found any article on a fix for the split brain issue on AS 6. We recently had a network issue where our cluster, 2 hosts, split into 2 separate clusters and even though messages were re-established between the two, they never merged and formed a singleton. The only way to resolve this was to stop one of the clusters and start it back up at which point it forms the cluster again and becomes a slave.

Has there been any patches or packages released which resolves this issue?

I did find the following during my searches :

https://community.jboss.org/thread/168601

https://issues.jboss.org/browse/JBAS-9456

Seems like no work has been done on them though.

Here is the current jgroups setup I have configured on both JBoss' :

<UDP

singleton_name="udp"

mcast_port="${jboss.jgroups.udp.mcast_port:45688}"

mcast_addr="${jboss.jgroups.udp.mcast_addr,jboss.partition.udpGroup:228.11.11.11}"

bind_port="${jboss.jgroups.udp.bind_port:55200}"

tos="8"

ucast_recv_buf_size="20000000"

ucast_send_buf_size="640000"

mcast_recv_buf_size="25000000"

mcast_send_buf_size="640000"

loopback="true"

discard_incompatible_packets="true"

enable_bundling="false"

ip_ttl="${jgroups.udp.ip_ttl:2}"

thread_naming_pattern="cl"

timer.num_threads="12"

enable_diagnostics="${jboss.jgroups.enable_diagnostics:true}"

diagnostics_addr="${jboss.jgroups.diagnostics_addr:224.0.75.75}"

diagnostics_port="${jboss.jgroups.diagnostics_port:7500}"

thread_pool.enabled="true"

thread_pool.min_threads="20"

thread_pool.max_threads="200"

thread_pool.keep_alive_time="5000"

thread_pool.queue_enabled="true"

thread_pool.queue_max_size="1000"

thread_pool.rejection_policy="discard"

oob_thread_pool.enabled="true"

oob_thread_pool.min_threads="20"

oob_thread_pool.max_threads="200"

oob_thread_pool.keep_alive_time="1000"

oob_thread_pool.queue_enabled="false"

oob_thread_pool.rejection_policy="discard"/>

<FD_SOCK start_port="${jboss.jgroups.udp.fd_sock_port:54200}"/>

<VERIFY_SUSPECT timeout="10000"/>

<pbcast.NAKACK use_mcast_xmit="true" gc_lag="0"

retransmit_timeout="300,600,1200,2400,4800"

discard_delivered_msgs="true"/>

<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

max_bytes="400000"/>

<VIEW_SYNC avg_send_interval="10000"/>

<pbcast.GMS print_local_addr="true"

join_timeout="3000"

view_bundling="true"

view_ack_collection_timeout="5000"

resume_task_timeout="7500"/>

<pbcast.STREAMING_STATE_TRANSFER/>

<pbcast.FLUSH timeout="0" start_flush_timeout="10000"/>

</config>

And here is the output from the logs when the issue occurs :

SLAVE HOST : JASDALServ3

2012-11-08 18:56:02,128 INFO [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (VERIFY_SUSPECT.TimerThread,VirtualServCluster-HAPartition,JASDalServ3:1099) Suspected member: JASDalServ4:1099

2012-11-08 18:56:02,253 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-2,null) Block processed at JASDalServ3:1099

2012-11-08 18:56:02,925 INFO [org.jboss.ha.framework.server.ClusterPartition.lifecycle.VirtualServCluster] (Incoming-4,null) New cluster view for partition VirtualServCluster (id: 2, delta: -1, merge: false) : [JASDalServ3:1099]

2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) dead members: [JASDalServ4:1099]

2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) membership changed from 2 to 1

2012-11-08 18:56:02,925 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-4,null) Received new cluster view: [JASDalServ3:1099|2] [JASDalServ3:1099]

2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) Unblock processed at JASDalServ3:1099

2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (AsynchViewChangeHandler Thread) Begin notifyListeners, viewID: 2

2012-11-08 18:56:02,940 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) I am (JASDalServ3:1099) received membershipChanged event:

2012-11-08 18:56:02,940 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) Dead members: 1 ([JASDalServ4:1099])

2012-11-08 18:56:02,940 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) New Members : 0 ([])

2012-11-08 18:56:02,940 INFO [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) All Members : 1 ([JASDalServ3:1099])

2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) purgeDeadMembers, [JASDalServ4:1099]

2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) election result =true, electionPolicy=org.jboss.ha.singleton.HASingletonElectionPolicySimple

2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=false, viewID=115705935, partition=VirtualServCluster

2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) startNewMaster, isMasterNode=false

2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) election result =true, electionPolicy=DistributedReplicantManager

2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=false, viewID=115705935, partition=VirtualServClus

2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) startNewMaster, isMasterNode=false

2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonController] (AsynchViewChangeHandler Thread) starting singleton, mSingleton=null, mSingletonMBean=TSServices6:service=Mosaic

2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) startSingleton() : elected for master singleton node

2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonController] (AsynchViewChangeHandler Thread) Calling operation: startSingleton(), on target: 'TSServices6:service=Mosaic'

MASTER HOST: JASDalServ4

2012-11-08 18:56:04,138 WARN [org.jgroups.protocols.FD] (OOB-46,null) I was suspected by JASDalServ3:1099; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

2012-11-08 18:56:04,154 WARN [org.jgroups.protocols.FD] (OOB-47,null) I was suspected by JASDalServ3:1099; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

2012-11-08 18:56:04,169 WARN [org.jgroups.protocols.pbcast.GMS] (OOB-43,null) JASDalServ4:1099: not member of view [JASDalServ3:1099|2] [JASDalServ3:1099]; discarding it

Could anyone shed some light on what is going on here? I dont understand why the SUSPECT message is sent out but < 1 second later it has already decided that it should split and form its own cluster and then never merge back in.

Thanks

1. Re: JBoss AS 6 cluster keeps splitting into separate clusters, problem with jgroup?

rhusar Feb 13, 2013 7:58 AM (in response to stewart_g)

Both of the linked issues are unrelated, they complain about HASingleton service not restarting, your problem is cluster splitting and not merging afterwards.

You could try looking into known issues with the JGroups version you are using and try updating the version (micro release).

Moreover you can lower merge timeouts to attempt MERGE more often. Also turn on debug MERGE logs.

It could also be a network issue, traffic not being sent correctly back and forth. Try using a TCP stack instead of UDP.
Actions