0 Replies Latest reply on Aug 28, 2007 2:15 AM by nicolaou

    Shunning Problems

    nicolaou

      Hi,

      I have a cluster setup with two nodes on different machines. The setup has been working fine from the begging but lately I face the following problem. I have a process running on the cluster every morning that keeps the servers busy for some minutes(10 min). I noticed that not always but every now and then one of the nodes is suspected and shunned from the cluster but it never rejoins the group even when I restart the node. It seems that at the begging of the problem node 2 does not leave the group saying that it waits for an exit message and then it says that it is the coord and is being suspected???

      The log file of node 1 (192.168.202.56) is:

      2007-08-23 11:36:50,088 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.56:37355|28] [192.168.202.56:37355]
      2007-08-23 11:36:50,729 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.56:37357|28] [192.168.202.56:37357]
      2007-08-23 11:36:55,505 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.56:37352|16] [192.168.202.56:37352]
      2007-08-23 11:36:57,801 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[192.168.202.57:32932 (additional data: 19 bytes), 192.168.202.56:37348 (additional data: 19 bytes)], pingable_mbrs=[192.168.202.56:37348 (additional data: 19 bytes)], local_addr=192.168.202.56:37348 (additional data: 19 bytes)
      2007-08-23 11:36:58,374 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Suspected member: 192.168.202.57:32932 (additional data: 19 bytes)
      2007-08-23 11:37:00,303 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[192.168.202.57:32932 (additional data: 19 bytes), 192.168.202.56:37348 (additional data: 19 bytes)], pingable_mbrs=[192.168.202.56:37348 (additional data: 19 bytes)], local_addr=192.168.202.56:37348 (additional data: 19 bytes)
      2007-08-23 11:37:02,805 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[192.168.202.57:32932 (additional data: 19 bytes), 192.168.202.56:37348 (additional data: 19 bytes)], pingable_mbrs=[192.168.202.56:37348 (additional data: 19 bytes)], local_addr=192.168.202.56:37348 (additional data: 19 bytes)
      2007-08-23 11:37:03,306 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Suspected member: 192.168.202.57:32932 (additional data: 19 bytes)
      2007-08-23 11:37:05,571 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.56:37355|29] [192.168.202.56:37355, 192.168.202.57:32943]
      2007-08-23 11:37:05,571 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.56:37357|29] [192.168.202.56:37357, 192.168.202.57:32941]
      2007-08-23 11:37:05,595 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.56:37352|17] [192.168.202.56:37352, 192.168.202.57:32939]
      2007-08-23 11:37:05,604 INFO [org.jboss.cache.TreeCache] locking the / subtree to return the in-memory (transient) state
      2007-08-23 11:37:05,604 INFO [org.jboss.cache.TreeCache] locking the / subtree to return the in-memory (transient) state
      2007-08-23 11:37:05,613 INFO [org.jboss.cache.statetransfer.StateTransferGenerator_1241] returning the state for tree rooted in /(1024 bytes)
      2007-08-23 11:37:05,615 INFO [org.jboss.cache.statetransfer.StateTransferGenerator_1241] returning the state for tree rooted in /(1024 bytes)
      2007-08-23 11:37:05,632 INFO [org.jboss.cache.TreeCache] locking the / subtree to return the in-memory (transient) state
      2007-08-23 11:37:05,633 INFO [org.jboss.cache.statetransfer.StateTransferGenerator_1241] returning the state for tree rooted in /(1024 bytes)
      2007-08-23 11:37:06,821 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[192.168.202.57:32932 (additional data: 19 bytes), 192.168.202.56:37348 (additional data: 19 bytes)], pingable_mbrs=[192.168.202.56:37348 (additional data: 19 bytes)], local_addr=192.168.202.56:37348 (additional data: 19 bytes)
      2007-08-23 11:37:08,308 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Suspected member: 192.168.202.57:32932 (additional data: 19 bytes)
      2007-08-23 11:37:09,322 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[192.168.202.57:32932 (additional data: 19 bytes), 192.168.202.56:37348 (additional data: 19 bytes)], pingable_mbrs=[192.168.202.56:37348 (additional data: 19 bytes)], local_addr=192.168.202.56:37348 (additional data: 19 bytes)
      2007-08-23 11:37:11,824 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[192.168.202.57:32932 (additional data: 19 bytes), 192.168.202.56:37348 (additional data: 19 bytes)], pingable_mbrs=[192.168.202.56:37348 (additional data: 19 bytes)], local_addr=192.168.202.56:37348 (additional data: 19 bytes)
      2007-08-23 11:37:13,312 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Suspected member: 192.168.202.57:32932 (additional data: 19 bytes)
      2007-08-23 11:37:14,325 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[192.168.202.57:32932 (additional data: 19 bytes), 192.168.202.56:37348 (additional data: 19 bytes)], pingable_mbrs=[192.168.202.56:37348 (additional data: 19 bytes)], local_addr=192.168.202.56:37348 (additional data: 19 bytes)
      2007-08-23 11:37:16,826 WARN [org.jgroups.protocols.FD] ping_dest is null: members=[192.168.202.57:32932 (additional data: 19 bytes), 192.168.202.56:37348 (additional data: 19 bytes)], pingable_mbrs=[192.168.202.56:37348 (additional data: 19 bytes)], local_addr=192.168.202.56:37348 (additional data: 19 bytes)

      The log file of node 2 (192.168.202.57) is:

      2007-08-23 11:36:59,608 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.56:37352|17] [192.168.202.56:37352, 192.168.202.57:32939]
      2007-08-23 11:36:59,616 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.56:37357|29] [192.168.202.56:37357, 192.168.202.57:32941]
      2007-08-23 11:36:59,618 INFO [org.jboss.cache.TreeCache] viewAccepted(): [192.168.202.56:37355|29] [192.168.202.56:37355, 192.168.202.57:32943]
      2007-08-23 11:36:59,629 INFO [org.jboss.cache.TreeCache] received the state (size=1024 bytes)
      2007-08-23 11:36:59,629 INFO [org.jboss.cache.TreeCache] received the state (size=1024 bytes)
      2007-08-23 11:36:59,667 INFO [org.jboss.cache.TreeCache] received the state (size=1024 bytes)
      2007-08-23 11:37:01,813 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
      2007-08-23 11:37:04,316 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
      2007-08-23 11:37:04,815 WARN [org.jgroups.protocols.pbcast.CoordGmsImpl] I am the coord and I'm being am suspected -- will probably leave shortly
      2007-08-23 11:37:04,815 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] Suspected member: 192.168.202.57:32932 (additional data: 19 bytes)
      2007-08-23 11:37:06,817 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)

      The log file of node2 after restart is:

      -------------------------------------------------------
      GMS: address is 192.168.202.57:32999 (additional data: 19 bytes)
      -------------------------------------------------------
      2007-08-27 18:51:16,824 ERROR [org.jgroups.protocols.pbcast.ClientGmsImpl] suspect() should not be invoked on an instance of org.jgroups.protocols.pbcast.ClientGmsImpl
      2007-08-27 18:51:16,825 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Suspected member: 192.168.202.57:32932 (additional data: 19 bytes)
      2007-08-27 18:51:19,207 WARN [org.jgroups.protocols.pbcast.ClientGmsImpl] handleJoin(192.168.202.57:32999 (additional data: 19 bytes)) failed, retrying
      2007-08-27 18:51:21,826 ERROR [org.jgroups.protocols.pbcast.ClientGmsImpl] suspect() should not be invoked on an instance of org.jgroups.protocols.pbcast.ClientGmsImpl
      2007-08-27 18:51:21,826 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Suspected member: 192.168.202.57:32932 (additional data: 19 bytes)
      2007-08-27 18:51:26,828 ERROR [org.jgroups.protocols.pbcast.ClientGmsImpl] suspect() should not be invoked on an instance of org.jgroups.protocols.pbcast.ClientGmsImpl
      2007-08-27 18:51:26,828 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Suspected member: 192.168.202.57:32932 (additional data: 19 bytes)
      2007-08-27 18:51:28,211 WARN [org.jgroups.protocols.pbcast.ClientGmsImpl] handleJoin(192.168.202.57:32999 (additional data: 19 bytes)) failed, retrying

      The configuration of the cluster in cluster-service.xml file looks like:


      <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" mcast_port="45566" ip_ttl="8" ip_mcast="true" mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
      ucast_send_buf_size="800000" ucast_recv_buf_size="150000" loopback="false" />
      <PING timeout="2000" num_initial_members="3" up_thread="true" down_thread="true" />
      <MERGE2 min_interval="10000" max_interval="20000" />
      <FD shun="true" up_thread="true" down_thread="true" timeout="2500" max_tries="5" />
      <VERIFY_SUSPECT timeout="3000" num_msgs="3" up_thread="true" down_thread="true" />
      <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800" max_xmit_size="8192" up_thread="true" down_thread="true" />
      <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10" down_thread="true" />
      <pbcast.STABLE desired_avg_gossip="20000" up_thread="true" down_thread="true" />
      <FRAG frag_size="8192" down_thread="true" up_thread="true" />
      <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true" />
      <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />



      Thanks,

      Christos.