1 Reply Latest reply on Feb 13, 2013 7:58 AM by rhusar

    JBoss AS 6 cluster keeps splitting into separate clusters, problem with jgroup?

    stewart_g

      Hi

       

      I have spent a lot of time searching the community and issues site of jboss.org but I havent found any article on a fix for the split brain issue on AS 6. We recently had a network issue where our cluster, 2 hosts, split into 2 separate clusters and even though messages were re-established between the two, they never merged and formed a singleton. The only way to resolve this was to stop one of the clusters and start it back up at which point it forms the cluster again and becomes a slave.

       

      Has there been any patches or packages released which resolves this issue?

       

      I did find the following during my searches :

       

      https://community.jboss.org/thread/168601

       

      https://issues.jboss.org/browse/JBAS-9456

       

      Seems like no work has been done on them though.

       

      Here is the current jgroups setup I have configured on both JBoss' :

       

      <config>

      <UDP

                   singleton_name="udp"

                   mcast_port="${jboss.jgroups.udp.mcast_port:45688}"

                   mcast_addr="${jboss.jgroups.udp.mcast_addr,jboss.partition.udpGroup:228.11.11.11}"

                   bind_port="${jboss.jgroups.udp.bind_port:55200}"

                   tos="8"

                   ucast_recv_buf_size="20000000"

                   ucast_send_buf_size="640000"

                   mcast_recv_buf_size="25000000"

                   mcast_send_buf_size="640000"

                   loopback="true"

                   discard_incompatible_packets="true"

                   enable_bundling="false"

                   ip_ttl="${jgroups.udp.ip_ttl:2}"

                   thread_naming_pattern="cl"

                   timer.num_threads="12"

                   enable_diagnostics="${jboss.jgroups.enable_diagnostics:true}"

                   diagnostics_addr="${jboss.jgroups.diagnostics_addr:224.0.75.75}"

                   diagnostics_port="${jboss.jgroups.diagnostics_port:7500}"

                  

                   thread_pool.enabled="true"

                   thread_pool.min_threads="20"

                   thread_pool.max_threads="200"

                   thread_pool.keep_alive_time="5000"

                   thread_pool.queue_enabled="true"

                   thread_pool.queue_max_size="1000"

                   thread_pool.rejection_policy="discard"

                 

                   oob_thread_pool.enabled="true"

                   oob_thread_pool.min_threads="20"

                   oob_thread_pool.max_threads="200"

                   oob_thread_pool.keep_alive_time="1000"

                   oob_thread_pool.queue_enabled="false"

                   oob_thread_pool.rejection_policy="discard"/>

                <PING timeout="2000" num_initial_members="3"/>

                <MERGE2 max_interval="100000" min_interval="20000"/>

                <FD_SOCK start_port="${jboss.jgroups.udp.fd_sock_port:54200}"/>

                <FD timeout="6000" max_tries="5"/>

                <VERIFY_SUSPECT timeout="10000"/>

                <BARRIER/>

                <pbcast.NAKACK use_mcast_xmit="true" gc_lag="0"

                         retransmit_timeout="300,600,1200,2400,4800"

                         discard_delivered_msgs="true"/>

                <UNICAST timeout="300,600,1200,2400,3600"/>

                <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

                         max_bytes="400000"/>

                <VIEW_SYNC avg_send_interval="10000"/>

                <pbcast.GMS print_local_addr="true"

                         join_timeout="3000"

                         view_bundling="true"                  

                         view_ack_collection_timeout="5000"

                         resume_task_timeout="7500"/>

                <UFC max_credits="2000000" ignore_synchronous_response="true"/>

                <MFC max_credits="2000000" ignore_synchronous_response="true"/>

                <FRAG2 frag_size="60000"/>

                <pbcast.STREAMING_STATE_TRANSFER/>

                <!--pbcast.STATE_TRANSFER/-->

                <pbcast.FLUSH timeout="0" start_flush_timeout="10000"/>

              </config>

       

      And here is the output from the logs when the issue occurs :

       

      SLAVE HOST : JASDALServ3

      2012-11-08 18:56:02,128 INFO  [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (VERIFY_SUSPECT.TimerThread,VirtualServCluster-HAPartition,JASDalServ3:1099) Suspected member: JASDalServ4:1099

      2012-11-08 18:56:02,253 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-2,null) Block processed at JASDalServ3:1099

      2012-11-08 18:56:02,925 INFO  [org.jboss.ha.framework.server.ClusterPartition.lifecycle.VirtualServCluster] (Incoming-4,null) New cluster view for partition VirtualServCluster (id: 2, delta: -1, merge: false) : [JASDalServ3:1099]

      2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) dead members: [JASDalServ4:1099]

      2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) membership changed from 2 to 1

      2012-11-08 18:56:02,925 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-4,null) Received new cluster view: [JASDalServ3:1099|2] [JASDalServ3:1099]

      2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (Incoming-4,null) Unblock processed at JASDalServ3:1099

      2012-11-08 18:56:02,925 DEBUG [org.jboss.ha.framework.server.ClusterPartition.VirtualServCluster] (AsynchViewChangeHandler Thread) Begin notifyListeners, viewID: 2

      2012-11-08 18:56:02,940 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) I am (JASDalServ3:1099) received membershipChanged event:

      2012-11-08 18:56:02,940 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) Dead members: 1 ([JASDalServ4:1099])

      2012-11-08 18:56:02,940 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) New Members : 0 ([])

      2012-11-08 18:56:02,940 INFO  [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) All Members : 1 ([JASDalServ3:1099])

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.core.framework.server.DistributedReplicantManagerImpl.VirtualServCluster] (AsynchViewChangeHandler Thread) purgeDeadMembers, [JASDalServ4:1099]

      2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) election result =true, electionPolicy=org.jboss.ha.singleton.HASingletonElectionPolicySimple

      2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=false, viewID=115705935, partition=VirtualServCluster

      2012-11-08 18:56:02,940 DEBUG [org.jboss.modcluster.ha.HAModClusterService] (AsynchViewChangeHandler Thread) startNewMaster, isMasterNode=false

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) election result =true, electionPolicy=DistributedReplicantManager

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) partitionTopologyChanged, isElectedNewMaster=true, isMasterNode=false, viewID=115705935, partition=VirtualServClus

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) startNewMaster, isMasterNode=false

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonController] (AsynchViewChangeHandler Thread) starting singleton, mSingleton=null, mSingletonMBean=TSServices6:service=Mosaic

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonSupport$HASingletonService] (AsynchViewChangeHandler Thread) startSingleton() : elected for master singleton node

      2012-11-08 18:56:02,940 DEBUG [org.jboss.ha.singleton.HASingletonController] (AsynchViewChangeHandler Thread) Calling operation: startSingleton(), on target: 'TSServices6:service=Mosaic'

       

      MASTER HOST: JASDalServ4

      2012-11-08 18:56:04,138 WARN  [org.jgroups.protocols.FD] (OOB-46,null) I was suspected by JASDalServ3:1099; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      2012-11-08 18:56:04,154 WARN  [org.jgroups.protocols.FD] (OOB-47,null) I was suspected by JASDalServ3:1099; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      2012-11-08 18:56:04,169 WARN  [org.jgroups.protocols.pbcast.GMS] (OOB-43,null) JASDalServ4:1099: not member of view [JASDalServ3:1099|2] [JASDalServ3:1099]; discarding it

       

      Could anyone shed some light on what is going on here? I dont understand why the SUSPECT message is sent out but < 1 second later it has already decided that it should split and form its own cluster and then never merge back in.

       

      Thanks

        • 1. Re: JBoss AS 6 cluster keeps splitting into separate clusters, problem with jgroup?
          rhusar

          Both of the linked issues are unrelated, they complain about HASingleton service not restarting, your problem is cluster splitting and not merging afterwards.

           

          You could try looking into known issues with the JGroups version you are using and try updating the version (micro release).

           

          Moreover you can lower merge timeouts to attempt MERGE more often. Also turn on debug MERGE logs.

           

          It could also be a network issue, traffic not being sent correctly back and forth. Try using a TCP stack instead of UDP.