Flow Control (FC) blocks cluster during failure detection
richtaylor Jun 1, 2010 2:32 PMJBoss 5.1.0
JDK 1.6.0_16-b01
We have three servers clustered for web sessions and jboss cache in production. Things typically run great, however on occasion (every few weeks under load), our entire cluster locks up and does not respond to any new requests (for approx. 40-60 seconds). This is obviously bad, as it trips off our alarms etc. Doesn't sound like much, but our entire product stops responding to everyone.
I've narrowed it down to a long FULL GC happening on one of the servers (Call this server A), and the other servers block / hang while trying to talk with the server doing the GC. The server unblocks when either the GC is done or the server is removed from the cluster by the others (via FD after approx. 35-40 seconds).
It is essentially what is described here:
https://community.jboss.org/message/260697#260697
And I have read this:
http://community.jboss.org/wiki/JGroupsFC
I've tried lowering the FD timeouts and sure enough, the cluster unblocks when "server A" is removed from the cluster.
The really long Full GC is a separate issue that we're working on. In theory that would prevent most of these cluster wide blocks. However there may be other things that would cause FC to hang the entire cluster while waiting for credits.
My question is: Can we safely tweak FC (using max_block_time or max_block_times) to prevent the entire cluster from blocking / locking up while one node is busy (i.e. Full GC)? Or does the very nature of the problem mean that the cluster (at least requests to servers that write to the session etc.) will be blocked until FD kicks the paused node out of the cluster?
I tried a low max_block_time but that didn't seem to have the right effect. After looking at the source code in FC.java, I found max_block_times (plural), tried that and it seemed to work like a charm. But it seems to bypass the credit process if there are no tokens available. I'm trying to understand the ramifications of using max_block_times as a work around.
Example stack trace, all of our jboss-web threads end up in this state on the blocked servers in the cluster (Not server A obviously) :
"ajp-0.0.0.0-8009-12" daemon prio=5 tid=0x0000000169723000 nid=0x176e84000 waiting on condition [0x0000000176e82000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000133c14b98> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2054) at org.jgroups.protocols.FC.handleDownMessage(FC.java:551) at org.jgroups.protocols.FC.down(FC.java:426) at org.jgroups.protocols.FRAG2.down(FRAG2.java:154) at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:209) at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:291) at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:461) at org.jgroups.JChannel.downcall(JChannel.java:1540) at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:791) at org.jgroups.blocks.RequestCorrelator.sendRequest(RequestCorrelator.java:304) at org.jgroups.blocks.GroupRequest.sendRequest(GroupRequest.java:531) at org.jgroups.blocks.GroupRequest.execute(GroupRequest.java:227) at org.jgroups.blocks.MessageDispatcher.castMessage(MessageDispatcher.java:468) at org.jboss.cache.marshall.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:397) at org.jboss.cache.marshall.CommandAwareRpcDispatcher$ReplicationTask.call(CommandAwareRpcDispatcher.java:355) at org.jboss.cache.util.concurrent.WithinThreadExecutor.submit(WithinThreadExecutor.java:82) at org.jboss.cache.marshall.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:210) at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:744) at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:712) at org.jboss.cache.RPCManagerImpl.callRemoteMethods(RPCManagerImpl.java:717) at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:161) at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:135) at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:107) at org.jboss.cache.interceptors.ReplicationInterceptor.runPreparePhase(ReplicationInterceptor.java:192) at org.jboss.cache.interceptors.ReplicationInterceptor.visitPrepareCommand(ReplicationInterceptor.java:72) at org.jboss.cache.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:68) at org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) at org.jboss.cache.interceptors.NotificationInterceptor.visitPrepareCommand(NotificationInterceptor.java:50) at org.jboss.cache.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:68) at org.jboss.cache.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116) at org.jboss.cache.interceptors.TxInterceptor.handleCommitRollback(TxInterceptor.java:539) at org.jboss.cache.interceptors.TxInterceptor.runCommitPhase(TxInterceptor.java:572) at org.jboss.cache.interceptors.TxInterceptor$RemoteSynchronizationHandler.afterCompletion(TxInterceptor.java:969) at org.jboss.cache.interceptors.TxInterceptor$LocalSynchronizationHandler.afterCompletion(TxInterceptor.java:1156) at org.jboss.cache.interceptors.OrderedSynchronizationHandler.afterCompletion(OrderedSynchronizationHandler.java:92) at org.jboss.cache.transaction.DummyTransaction.notifyAfterCompletion(DummyTransaction.java:307) at org.jboss.cache.transaction.DummyTransaction.commit(DummyTransaction.java:96) at org.jboss.cache.transaction.DummyBaseTransactionManager.commit(DummyBaseTransactionManager.java:109) at org.jboss.web.tomcat.service.session.distributedcache.impl.jbc.BatchingManagerImpl.endBatch(BatchingManagerImpl.java:70) at org.jboss.web.tomcat.service.session.JBossCacheManager.processSessionRepl(JBossCacheManager.java:1967) at org.jboss.web.tomcat.service.session.JBossCacheManager.storeSession(JBossCacheManager.java:309) - locked <0x000000013b7f5168> (a org.jboss.web.tomcat.service.session.AttributeBasedClusteredSession) at org.jboss.web.tomcat.service.session.InstantSnapshotManager.snapshot(InstantSnapshotManager.java:51) at org.jboss.web.tomcat.service.session.ClusteredSessionValve.handleRequest(ClusteredSessionValve.java:147) at org.jboss.web.tomcat.service.session.ClusteredSessionValve.invoke(ClusteredSessionValve.java:94) at org.jboss.web.tomcat.service.session.JvmRouteValve.invoke(JvmRouteValve.java:88) at org.jboss.web.tomcat.service.session.LockingValve.invoke(LockingValve.java:62) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:433) at com.foo.servlets.valves.InsecureSessionValve.invoke(InsecureSessionValve.java:58) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92) at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextEstablishmentValve.java:126) at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:70) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:158) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330) at org.apache.coyote.ajp.AjpProcessor.process(AjpProcessor.java:436) at org.apache.coyote.ajp.AjpProtocol$AjpConnectionHandler.process(AjpProtocol.java:384) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:637)