Cluster hangs waiting for State Transfer lock
gustavonalle Aug 17, 2012 8:53 PMHi, we are currently observing a strange behaviour in infinispan.
Basically we have a 4 node cluster which is stopped and started by puppet simultaneously and sometimes (the more nodes in the cluster the easier is to reproduce) during start the whole cluster hangs. The thread dump shows the problematic section:
"AsyncHttpClient-Callback" daemon prio=10 tid=0x0000000000a8d800 nid=0x39f8 in Object.wait() [0x00007fa8962df000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000007dd22ac48> (a java.lang.Object)
at org.infinispan.statetransfer.StateTransferLockImpl.acquireLockForWriteCommand(StateTransferLockImpl.java:309)
- locked <0x00000007dd22ac48> (a java.lang.Object)
at org.infinispan.statetransfer.StateTransferLockImpl.acquireForCommand(StateTransferLockImpl.java:128)
at org.infinispan.interceptors.StateTransferLockInterceptor.handleWriteCommand(StateTransferLockInterceptor.java:176)
at org.infinispan.interceptors.StateTransferLockInterceptor.visitPutKeyValueCommand(StateTransferLockInterceptor.java:152)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:125)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:132)
at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:91)
at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:62)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:77)
at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:345)
at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1006)
at org.infinispan.CacheImpl.put(CacheImpl.java:702)
at org.infinispan.CacheImpl.put(CacheImpl.java:694)
at org.infinispan.CacheSupport.put(CacheSupport.java:128)
Looking at the TRACE logs on more than one node, it shows a replication failure, with the same sender:
23:53:55.801 [asyncTransportThread-0] TRACE o.i.remoting.rpc.RpcManagerImpl - replication exception:
org.infinispan.remoting.transport.jgroups.SuspectException: One or more nodes have left the cluster while replicating command StateTransferControlCommand{cache=c.t.e.search.s.service.permissions.serviceCache, type=APPLY_STATE, sender=uk2x-rscnode12-52574, viewId=4, state=0, locks=N/A}
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:456) ~[sc-2012.1.1-SNAPSHOT.jar:na]
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:161) ~[sc-2012.1.1-SNAPSHOT.jar:na]
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:183) ~[sc-2012.1.1-SNAPSHOT.jar:na]
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:240) ~[sc-2012.1.1-SNAPSHOT.jar:na]
at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:78) ~[sc-2012.1.1-SNAPSHOT.jar:na]
at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:274) [sc-2012.1.1-SNAPSHOT.jar:na]
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) [na:1.7.0_03]
at java.util.concurrent.FutureTask.run(FutureTask.java:166) [na:1.7.0_03]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) [na:1.7.0_03]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) [na:1.7.0_03]
at java.lang.Thread.run(Thread.java:722) [na:1.7.0_03]
If I stop ONLY the sender, the whole cluster ressurects and the write locks are released on all nodes. Otherwise, the whole cluster remains frozen.
The caches are REPL_ASYN using embedded jgroups-udp.xml. Any suggestion where I should be looking to troubleshoot this, or maybe any configuration changes to avoid this cluster-wide locking?
Thanks,
Gustavo