1 Reply Latest reply on Feb 13, 2013 7:58 AM by Radoslav Husar

    JBoss AS 6 cluster keeps splitting into separate clusters, problem with jgroup?

    Stewart Gracie Newbie

      Hi

       

      I have spent a lot of time searching the community and issues site of jboss.org but I havent found any article on a fix for the split brain issue on AS 6. We recently had a network issue where our cluster, 2 hosts, split into 2 separate clusters and even though messages were re-established between the two, they never merged and formed a singleton. The only way to resolve this was to stop one of the clusters and start it back up at which point it forms the cluster again and becomes a slave.

       

      Has there been any patches or packages released which resolves this issue?

       

      I did find the following during my searches :

       

      https://community.jboss.org/thread/168601

       

      https://issues.jboss.org/browse/JBAS-9456

       

      Seems like no work has been done on them though.

       

      Here is the current jgroups setup I have configured on both JBoss' :

       

      <config>

      <UDP

                   singleton_name="udp"

                   mcast_port="${jboss.jgroups.udp.mcast_port:45688}"

                   mcast_addr="${jboss.jgroups.udp.mcast_addr,jboss.partition.udpGroup:228.11.11.11}"

                   bind_port="${jboss.jgroups.udp.bind_port:55200}"

                   tos="8"

                   ucast_recv_buf_size="20000000"

                   ucast_send_buf_size="640000"

                   mcast_recv_buf_size="25000000"

                   mcast_send_buf_size="640000"

                   loopback="true"

                   discard_incompatible_packets="true"

                   enable_bundling="false"

                   ip_ttl="${jgroups.udp.ip_ttl:2}"

                   thread_naming_pattern="cl"

                   timer.num_threads="12"

                   enable_diagnostics="${jboss.jgroups.enable_diagnostics:true}"

                   diagnostics_addr="${jboss.jgroups.diagnostics_addr:224.0.75.75}"

                   diagnostics_port="${jboss.jgroups.diagnostics_port:7500}"

                  

                   thread_pool.enabled="true"

                   thread_pool.min_threads="20"

                   thread_pool.max_threads="200"

                   thread_pool.keep_alive_time="5000"

                   thread_pool.queue_enabled="true"

                   thread_pool.queue_max_size="1000"

                   thread_pool.rejection_policy="discard"

                 

                   oob_thread_pool.enabled="true"

                   oob_thread_pool.min_threads="20"

                   oob_thread_pool.max_threads="200"

                   oob_thread_pool.keep_alive_time="1000"

                   oob_thread_pool.queue_enabled="false"

                   oob_thread_pool.rejection_policy="discard"/>

                <PING timeout="2000" num_initial_members="3"/>

                <MERGE2 max_interval="100000" min_interval="20000"/>

                <FD_SOCK start_port="${jboss.jgroups.udp.fd_sock_port:54200}"/>

                <FD timeout="6000" max_tries="5"/>

                <VERIFY_SUSPECT timeout="10000"/>

                <BARRIER/>

                <pbcast.NAKACK use_mcast_xmit="true" gc_lag="0"

                         retransmit_timeout="300,600,1200,2400,4800"

                         discard_delivered_msgs="true"/>

                <UNICAST timeout="300,600,1200,2400,3600"/>

                <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

                         max_bytes="400000"/>

                <VIEW_SYNC avg_send_interval="10000"/>

                <pbcast.GMS print_local_addr="true"

                         join_timeout="3000"

                         view_bundling="true"                  

                         view_ack_collection_timeout="5000"

                         resume_task_timeout="7500"/>

                <UFC max_credits="2000000" ignore_synchronous_response="true"/>

                <MFC max_credits="2000000" ignore_synchronous_response="true"/>

                <FRAG2 frag_size="60000"/>

                <pbcast.STREAMING_STATE_TRANSFER/>

                <!--pbcast.STATE_TRANSFER/-->

                <pbcast.FLUSH timeout="0" start_flush_timeout="10000"/>

              </config>

       

      And here is the output from the logs when the issue occurs :

       

      SLAVE HOST : JASDALServ3

      2012-11-08 18:56:02,128 INFO  [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (VERIFY_SUSPECT.TimerThread,VirtualServCluster-HAPartition,JASDalServ3:1099) Suspected member: JASDalServ4:1099

      2012-11-08 18:56:02,253 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-2,null) Block processed at JASDalServ3:1099

      2012-11-08 18:56:02,925 INFO  [org.jboss.ha.framework.server.ClusterPartition.lifecycle.VirtualServCluster] (Incoming-4,null) New cluster view for partition VirtualServCluster (id: 2, delta: -1, merge: false) : [JASDalServ3:1099]

      2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) dead members: [JASDalServ4:1099]

      2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) membership changed from 2 to 1

      2012-11-08 18:56:02,925 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-4,null) Received new cluster view: [JASDalServ3:1099|2] [JASDalServ3:1099]

      2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) Unblock processed at JASDalServ3:1099

      2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (AsynchViewChangeHandler Thread) Begin notifyListeners, viewID: 2

      2012-11-08 18:56:02,940 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) I am (JASDalServ3:1099) received membershipChanged event:

      2012-11-08 18:56:02,940 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) Dead members: 1 ([JASDalServ4:1099])

      2012-11-08 18:56:02,940 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) New Members : 0 ([])

      2012-11-08 18:56:02,940 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) All Members : 1 ([JASDalServ3:1099])

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) purgeDeadMembers, [JASDalServ4:1099]

      2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) election result =true, electionPolicy=org.jboss.ha.singleton.HASingletonElectionPolicySimple

      2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=false, viewID=115705935, partition=VirtualServCluster

      2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) startNewMaster, isMasterNode=false

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) election result =true, electionPolicy=DistributedReplicantManager

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=false, viewID=115705935, partition=VirtualServClus

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) startNewMaster, isMasterNode=false

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonController] (AsynchViewChangeHandler Thread) starting singleton, mSingleton=null, mSingletonMBean=TSServices6:service=Mosaic

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) startSingleton() : elected for master singleton node

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonController] (AsynchViewChangeHandler Thread) Calling operation: startSingleton(), on target: 'TSServices6:service=Mosaic'

       

      MASTER HOST: JASDalServ4

      2012-11-08 18:56:04,138 WARN  [org.jgroups.protocols.FD] (OOB-46,null) I was suspected by JASDalServ3:1099; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      2012-11-08 18:56:04,154 WARN  [org.jgroups.protocols.FD] (OOB-47,null) I was suspected by JASDalServ3:1099; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      2012-11-08 18:56:04,169 WARN  [org.jgroups.protocols.pbcast.GMS] (OOB-43,null) JASDalServ4:1099: not member of view [JASDalServ3:1099|2] [JASDalServ3:1099]; discarding it

       

      Could anyone shed some light on what is going on here? I dont understand why the SUSPECT message is sent out but < 1 second later it has already decided that it should split and form its own cluster and then never merge back in.

       

      Thanks