1 Reply Latest reply on Nov 5, 2007 12:08 PM by jboss_cody

    Not removing failed node from partition

    nsaunders

      Hi All,

      I'm having some problems configuring jgroups failure detection. I'm using jBoss JBoss [Zion] 4.0.5.GA (build: CVSTag=Branch_4_0 date=200610162339) and jgroups 2.2.9 beta2.

      My setup consists of 2 nodes, each with 8 interfaces, paired into 4 redundant groups per node.


      | |
       | App 1 | App 1
       | |
       ----------------- -----------------
       App 3 | | Mgmt | | App3
       ------| NODE 1 |---------------| NODE 2 |-------
       | | | |
       ----------------- -----------------
       | |
       | |
       | Database |
       ---------------------------------


      I've used the -b option to bind all traffic to the management lan, and tweaked my application to bind only to the App 1 LAN. What I'm now trying to do is have the healthcheck/failure detection run explicitly over the Database LAN.

      I've editted my cluster-service.xml file, which looks like this (editted for brevity):

      <Config>
       <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" mcast_port="45566"
       ip_ttl="${jgroups.mcast.ip_ttl:8}" ip_mcast="true"
       mcast_recv_buf_size="2000000" mcast_send_buf_size="640000"
       ucast_recv_buf_size="2000000" ucast_send_buf_size="640000"
       loopback="false"/>
       <PING timeout="2000" num_initial_members="3"
       up_thread="true" down_thread="true"/>
       <MERGE2 min_interval="10000" max_interval="20000"/>
       <FD_SOCK srv_sock_bind_addr="192.168.104.56" down_thread="false" up_thread="false"/>
       <VERIFY_SUSPECT timeout="3000" num_msgs="3"
       up_thread="true" down_thread="true"/>
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
       max_xmit_size="8192"
       up_thread="true" down_thread="true"/>
       <UNICAST timeout="300,600,1200,2400,4800" down_thread="true"/>
       <pbcast.STABLE desired_avg_gossip="20000" max_bytes="400000"
       up_thread="true" down_thread="true"/>
       <FRAG frag_size="8192"
       down_thread="true" up_thread="true"/>
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
       shun="true" print_local_addr="true"/>
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
       </Config>


      This is the config on Node 2 - 192.168.104.56 is this nodes VIP for the database interface.

      I have removed the standard FD algorithm, as it was failing to detect a failure when the database connection was broken (Based on what I read here http://www.redhat.com/docs/manuals/jboss/jboss-eap-4.2/doc/Server_Configuration_Guide/Failure_Detection_Protocols-FD.html - "Regular traffic from a node counts as if it is a live. So, the are-you-alive messages are only sent when there is no regular traffic to the node for sometime." - I assumed it must have seen other jGroups traffic across the management LAN and so counted it as alive).

      The problem is that now the second node detects a failure correctly and sets itself to the master node, but getCurrentView() is still returning two nodes:

      Fri Jul 20 14:34:17 BST 2007 MasterNode=true 192.168.105.158:1099 192.168.105.155:1099


      (192.168.105.158 & 192.168.105.155 are the Management interfaces for Nodes 1 & 2 respecively)

      What's interesting is when I plugged Node 1 back in, Node 2 breifly removed it from the partition before merging it back in:

      Fri Jul 20 14:35:05 BST 2007 MasterNode=true 192.168.105.158:1099 192.168.105.155:1099
      Fri Jul 20 14:35:11 BST 2007 MasterNode=true 192.168.105.158:1099 192.168.105.155:1099
      Fri Jul 20 14:35:16 BST 2007 MasterNode=true 192.168.105.155:1099
      Fri Jul 20 14:35:22 BST 2007 MasterNode=true 192.168.105.155:1099 192.168.105.158:1099
      Fri Jul 20 14:35:29 BST 2007 MasterNode=true 192.168.105.155:1099 192.168.105.158:1099


      Any advice or assistance would be greatly appreciated!

      Kind Regards,

      Neil Saunders.



        • 1. Re: Not removing failed node from partition
          jboss_cody

          I am having a similar problem to this.

          Why are my deadMembers not being removed?


          purgeDeadMembers, [192.168.202.x:1099]
          2007-11-04 17:00:03,287 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] trying to remove deadMember 192.168.202.x:1099 for key DCacheBridge-DefaultJGBridge
          2007-11-04 17:00:03,287 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] 192.168.202.x:1099 was NOT removed!!!
          2007-11-04 17:00:03,288 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] trying to remove deadMember 192.168.202.x:1099 for key jboss.ha:service=HASingletonDeployer
          2007-11-04 17:00:03,288 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] 192.168.202.x:1099 was NOT removed!!!
          2007-11-04 17:00:03,288 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] trying to remove deadMember 192.168.202.x:1099 for key HAJNDI
          2007-11-04 17:00:03,288 DEBUG [org.jboss.ha.framework.server.DistributedReplicantManagerImpl.jboss1] 192.168.202.x:1099 was NOT removed!!!
          2007-11-04 17:00:03,289 DEBUG [org.jboss.ha.framework.interfaces.HAPartition.jboss1] End notifyListeners, viewID: 11
          
          


          Any reply is appreciated...