1 Reply Latest reply on Jan 2, 2009 7:58 AM by Manik Surtani

    Cluster hanging when a member is non responsive

    Amrit Jassal Newbie

      I have a jboss cache cluster with 2 nodes. When one of the nodes is over-loaded or is running into OOM issues, the other node also becomes non-responsive. A thread dump on the (non-OOM) instance shows jboss cache threads waiting on a lock (excerpt below).

      Do I need to tweak the failure detection protocol somehow?

      Configuration:

      Version: 2.2.1.GA
      Codename: Poblano

      Replication mode: REPL_ASYNC



      <!-- UDP: if you have a multihomed machine,
      set the bind_addr attribute to the appropriate NIC IP address -->
      <!-- UDP: On Windows machines, because of the media sense feature
      being broken with multicast (even after disabling media sense)
      set the loopback attribute to true -->
      <UDP mcast_addr="228.8.8.8" mcast_port="45567"
      bind_addr="127.0.0.1" bind_to_all_interfaces="false"
      ip_ttl="64" ip_mcast="true" mcast_send_buf_size="150000"
      mcast_recv_buf_size="80000" ucast_send_buf_size="150000"
      ucast_recv_buf_size="80000" loopback="false" />
      <PING timeout="2000" num_initial_members="3" />
      <MERGE2 min_interval="10000" max_interval="20000" />
      <FD_SOCK/>
      <VERIFY_SUSPECT timeout="1500" />
      <pbcast.NAKACK gc_lag="50"
      retransmit_timeout="600,1200,2400,4800" />

      <pbcast.STABLE desired_avg_gossip="20000" />
      <FRAG frag_size="8192" />
      <pbcast.GMS join_timeout="5000" shun="true" print_local_addr="true" />
      <pbcast.STATE_TRANSFER />




      Thread dump:

      java.lang.Thread.State: WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)
      - parking to wait for <0x00002aaacd330d30> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
      at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1889)
      at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:254)
      at org.jgroups.blocks.BasicConnectionTable$Connection.send(BasicConnectionTable.java:499)
      at org.jgroups.blocks.BasicConnectionTable.send(BasicConnectionTable.java:322)
      at org.jgroups.protocols.TCP.send(TCP.java:55)
      at org.jgroups.protocols.BasicTCP.sendToSingleMember(BasicTCP.java:209)
      at org.jgroups.protocols.BasicTCP.sendToAllMembers(BasicTCP.java:194)
      at org.jgroups.protocols.TP.doSend(TP.java:1476)
      at org.jgroups.protocols.TP.send(TP.java:1466)
      at org.jgroups.protocols.TP.down(TP.java:1187)
      at org.jgroups.protocols.Discovery.down(Discovery.java:373)
      at org.jgroups.protocols.MERGE2.down(MERGE2.java:175)
      at org.jgroups.protocols.FD_SOCK.down(FD_SOCK.java:367)
      at org.jgroups.protocols.VERIFY_SUSPECT.down(VERIFY_SUSPECT.java:95)
      at org.jgroups.protocols.pbcast.NAKACK.send(NAKACK.java:787)
      at org.jgroups.protocols.pbcast.NAKACK.down(NAKACK.java:589)
      at org.jgroups.protocols.UNICAST.down(UNICAST.java:462)
      at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:316)
      at org.jgroups.protocols.FRAG.down(FRAG.java:136)
      at org.jgroups.protocols.pbcast.GMS.down(GMS.java:858)
      at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:200)
      at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:457)
      at org.jgroups.JChannel.downcall(JChannel.java:1474)
      at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:780)
      at org.jgroups.blocks.RequestCorrelator.sendRequest(RequestCorrelator.java:303)
      at org.jgroups.blocks.GroupRequest.sendRequest(GroupRequest.java:545)
      at org.jgroups.blocks.GroupRequest.execute(GroupRequest.java:228)
      at org.jgroups.blocks.MessageDispatcher.castMessage(MessageDispatcher.java:457)
      at org.jboss.cache.marshall.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:102)
      at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:403)
      at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:375)
      at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:380)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:143)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:117)
      at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:89)
      at org.jboss.cache.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:139)
      at org.jboss.cache.interceptors.ReplicationInterceptor.visitPutKeyValueCommand(ReplicationInterceptor.java:86)
      at org.jboss.cache.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:92)