0 Replies Latest reply on Nov 9, 2010 5:14 AM by aditi.andhare

SuspectException seen by one node when the other node in the cluster goes down

aditi.andhare Nov 9, 2010 5:14 AM

Hi all,

We are using the following configuration:

jGroups 3.2.0GA

jboss cache 3.2.0 GA

jboss AS 5.1.0 GA

I have two nodes in my clustered setup. Consider the case when there is continous load on both of these nodes. Now I stop one of the nodes in the cluster. The other node which is still active sees the SuspectException for a very few transactions hence affecting transactions performed by the active node. Here is the stack trace:

org.jboss.cache.SuspectException: Suspected member: 10.17.221.19:59378
        at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:764)
        at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:716)
        at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:721)
        at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:161)
        at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:135)
        at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:107)
        at org.jboss.cache.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:160)
        at org.jboss.cache.interceptors.ReplicationInterceptor.visitPutDataMapCommand(ReplicationInterceptor.java:113)
        at org.jboss.cache.commands.write.PutDataMapCommand.acceptVisitor(PutDataMapCommand.java:104)
        at org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
        at org.jboss.cache.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:131)
        at org.jboss.cache.commands.AbstractVisitor.visitPutDataMapCommand(AbstractVisitor.java:60)
        at org.jboss.cache.commands.write.PutDataMapCommand.acceptVisitor(PutDataMapCommand.java:104)
        at org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
        at org.jboss.cache.interceptors.TxInterceptor.attachGtxAndPassUpChain(TxInterceptor.java:301)
        at org.jboss.cache.interceptors.TxInterceptor.handleDefault(TxInterceptor.java:283)
        at org.jboss.cache.commands.AbstractVisitor.visitPutDataMapCommand(AbstractVisitor.java:60)
        at org.jboss.cache.commands.write.PutDataMapCommand.acceptVisitor(PutDataMapCommand.java:104)
        at org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
        at org.jboss.cache.interceptors.CacheMgmtInterceptor.visitPutDataMapCommand(CacheMgmtInterceptor.java:97)
        at org.jboss.cache.commands.write.PutDataMapCommand.acceptVisitor(PutDataMapCommand.java:104)
        at org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
        at org.jboss.cache.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:178)
        at org.jboss.cache.interceptors.InvocationContextInterceptor.visitPutDataMapCommand(InvocationContextInterceptor.java:64)
        at org.jboss.cache.commands.write.PutDataMapCommand.acceptVisitor(PutDataMapCommand.java:104)
        at org.jboss.cache.interceptors.InterceptorChain.invoke(InterceptorChain.java:287)
        at org.jboss.cache.invocation.CacheInvocationDelegate.invokePut(CacheInvocationDelegate.java:705)
        at org.jboss.cache.invocation.CacheInvocationDelegate.put(CacheInvocationDelegate.java:519)
        at org.jboss.cache.invocation.NodeInvocationDelegate.addChild(NodeInvocationDelegate.java:337)
        at com.openwave.servicebroker.util.CacheUtil.createNode(CacheUtil.java:204)
        at com.openwave.servicebroker.ServiceBrokerImpl.serviceRequest(ServiceBrokerImpl.java:838)
        at servicebroker.ServiceBroker$Processor$serviceRequest.process(ServiceBroker.java:809)
        at servicebroker.ServiceBroker$Processor.process(ServiceBroker.java:626)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:252)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

The code statement which causes the above exception is not in a transaction. Here is the FD configurations in the conf files:

         <FD_SOCK/>
         <FD max_tries="20" shun="false" timeout="60000"/>
         <VERIFY_SUSPECT timeout="1500"/>

Upon analysis of logs I see that this exception is always caused during the short time gap when one node is suspected and when the cluster recieves a new view.

2010-10-29 03:04:12,480 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.ClusterOne] (VERIFY_SUSPECT.TimerThread,ClusterOne,10.17.221.18:48782) Suspected member: 10.17.221.19:34650
2010-10-29 03:04:12,534 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.ClusterOne] (Incoming-17,10.17.221.18:48782) New cluster view for partition ClusterOne (id: 4, delta: -1) : [10.17.221.18:1099]

So my question is:

1. Is this an expected behaviour?

2. Whenever a member in a cluster goes down, will all the active transactions seen by the other active members in the cluster fail due to the Suspect Exception?

3. Or are there any configuration settings that I am missing out here?

Thanks in advance for all your help.

Aditi