9 Replies Latest reply on Dec 5, 2005 11:01 AM by belaban

    Issues with cluster node crashes with REPL_SYNC session repl

    brian.stansberry

      Opening this forum thread to talk about JBAS-2170 in particular and effects of nodes crashes with REPL_SYNC session replication in general.

        • 1. Re: Issues with cluster node crashes with REPL_SYNC session
          brian.stansberry

          A fundamental problem is shown in the stack trace below. A tx has been marked for rollback because one of the cluster nodes crashed and didn't respond to a prepare() call. The PessimisticLockInterceptor is attempting to rollback changes made in a tx. The rollback results in a nodeModified() event, which the JBossCacheService interprets as being to due to external modification of a session. It does a get() on the session node, but this call is not associated with the original tx context, so it cannot obtain a read lock. The JBossCacheService retries the 2 more times before finally throwing an exception. Only when the exception is caught does the PessimisticLockInterceptor move on and release the locks associated with the tx. This all takes 45 seconds with the default 15 second lock timeout.

          Effect is the original calling thread waits 45 seconds, or if the user gives up and tries another request, the session is inaccessible for 45 secs.

          This should be fixable by having JBossCacheService ignore nodeModified() events that result from its own activity. Ben has some code like that in HEAD.

          But that's a session-replication-only fix and this kind of thing could affect any TreeCache usage with cache listeners.

          2005-11-23 23:57:08,183 ERROR [org.jboss.cache.interceptors.PessimisticLockInterceptor] undo operation failed
          java.lang.RuntimeException: JBossCacheService: exception occurred in cache get after retry ...
           at org.jboss.web.tomcat.tc5.session.JBossCacheService._get(JBossCacheService.java:415)
           at org.jboss.web.tomcat.tc5.session.JBossCacheService.nodeDirty(JBossCacheService.java:679)
           at org.jboss.web.tomcat.tc5.session.JBossCacheService.nodeModified(JBossCacheService.java:670)
           at org.jboss.cache.TreeCache.notifyNodeModified(TreeCache.java:4519)
           at org.jboss.cache.TreeCache._put(TreeCache.java:3715)
           at jrockit.reflect.CompiledMethodInvoker.invoke0(Ljava.lang.Object;[Ljava.lang.Object;)Ljava.lang.Object;(Unknown Source)
           at jrockit.reflect.CompiledMethodInvoker.invokeCompiled(JLjava.lang.Object;[Ljava.lang.Object;)Ljava.lang.Object;(Native Method)
           at jrockit.reflect.CompiledMethodInvoker.invoke(Ljava.lang.Object;[Ljava.lang.Object;)Ljava.lang.Object;(Unknown Source)
           at java.lang.reflect.Method.invoke(Ljava.lang.Object;[Ljava.lang.Object;J)Ljava.lang.Object;(Unknown Source)
           at org.jgroups.blocks.MethodCall.invoke(MethodCall.java:321)
           at org.jboss.cache.interceptors.PessimisticLockInterceptor.rollback(PessimisticLockInterceptor.java:336)
           at org.jboss.cache.interceptors.PessimisticLockInterceptor.access$200(PessimisticLockInterceptor.java:31)
           at org.jboss.cache.interceptors.PessimisticLockInterceptor$SynchronizationHandler.afterCompletion(PessimisticLockInterceptor.java:418)
           at org.jboss.cache.interceptors.OrderedSynchronizationHandler.afterCompletion(OrderedSynchronizationHandler.java:80)
           at org.jboss.tm.TransactionImpl.doAfterCompletion(TransactionImpl.java:1508)
           at org.jboss.tm.TransactionImpl.completeTransaction(TransactionImpl.java:1180)
           at org.jboss.tm.TransactionImpl.commit(TransactionImpl.java:351)
           at org.jboss.tm.TxManager.commit(TxManager.java:224)
           at org.jboss.web.tomcat.tc5.session.JBossCacheManager.endTransaction(JBossCacheManager.java:636)
           at org.jboss.web.tomcat.tc5.session.JBossCacheManager.processSessionRepl(JBossCacheManager.java:627)
           at org.jboss.web.tomcat.tc5.session.JBossCacheManager.storeSession(JBossCacheManager.java:375)
           at org.jboss.web.tomcat.tc5.session.InstantSnapshotManager.snapshot(InstantSnapshotManager.java:38)
           at org.jboss.web.tomcat.tc5.session.ClusteredSessionValve.invoke(ClusteredSessionValve.java:91)
           at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:59)
           at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
           at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
           at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
           at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
           at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:856)
           at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:744)
           at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
           at org.apache.tomcat.util.net.MasterSlaveWorkerThread.run(MasterSlaveWorkerThread.java:112)
          Caused by: org.jboss.cache.lock.TimeoutException: read lock for /JSESSION/localhost/jbento-httpsession/CSgtHykPDCiOphzIH0qGpw** could not be acquired by Thread[http-172.16.141.205-8080-32,5,jboss] after 15000 ms. Locks: Read lock owners: {}
          Write lock owner: GlobalTransaction:<172.16.141.205:33942>:4252
          , lock info: write owner=GlobalTransaction:<172.16.141.205:33942>:4252 (activeReaders=0, activeWriter=Thread[http-172.16.141.205-8080-32,5,jboss], waitingReaders=0, waitingWriters=0, waitingUpgrader=0)
           at org.jboss.cache.lock.IdentityLock.acquireReadLock(IdentityLock.java:211)
           at org.jboss.cache.TreeNode.acquireReadLock(TreeNode.java:459)
           at org.jboss.cache.TreeNode.acquire(TreeNode.java:430)
           at org.jboss.cache.interceptors.PessimisticLockInterceptor.lock(PessimisticLockInterceptor.java:227)
           at org.jboss.cache.interceptors.PessimisticLockInterceptor.invoke(PessimisticLockInterceptor.java:164)
           at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:41)
           at org.jboss.cache.interceptors.UnlockInterceptor.invoke(UnlockInterceptor.java:35)
           at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:41)
           at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:56)
           at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:4339)
           at org.jboss.cache.TreeCache.get(TreeCache.java:2897)
           at org.jboss.cache.TreeCache.get(TreeCache.java:2880)
           at jrockit.reflect.CompiledMethodInvoker.invoke0(Ljava.lang.Object;[Ljava.lang.Object;)Ljava.lang.Object;(Unknown Source)
           at jrockit.reflect.CompiledMethodInvoker.invokeCompiled(JLjava.lang.Object;[Ljava.lang.Object;)Ljava.lang.Object;(Native Method)
           at jrockit.reflect.CompiledMethodInvoker.invoke(Ljava.lang.Object;[Ljava.lang.Object;)Ljava.lang.Object;(Unknown Source)
           at java.lang.reflect.Method.invoke(Ljava.lang.Object;[Ljava.lang.Object;J)Ljava.lang.Object;(Unknown Source)
           at org.jboss.mx.interceptor.ReflectedDispatcher.invoke(ReflectedDispatcher.java:141)
           at org.jboss.mx.server.AbstractMBeanInvoker.invoke(AbstractMBeanInvoker.java:174)
           at org.jboss.mx.util.MBeanProxyExt.invoke(MBeanProxyExt.java:84)
           at $Proxy27.get(Lorg.jboss.cache.Fqn;Ljava.lang.Object;)Ljava.lang.Object;(Unknown Source)
           at org.jboss.web.tomcat.tc5.session.JBossCacheService._get(JBossCacheService.java:402)
           at org.jboss.web.tomcat.tc5.session.JBossCacheService.nodeDirty(JBossCacheService.java:679)
           at org.jboss.web.tomcat.tc5.session.JBossCacheService.nodeModified(JBossCacheService.java:670)
           at org.jboss.cache.TreeCache.notifyNodeModified(TreeCache.java:4519)
           at org.jboss.cache.TreeCache._put(TreeCache.java:3715)
           at jrockit.reflect.CompiledMethodInvoker.invoke0(Ljava.lang.Object;[Ljava.lang.Object;)Ljava.lang.Object;(Unknown Source)
           at jrockit.reflect.CompiledMethodInvoker.invokeCompiled(JLjava.lang.Object;[Ljava.lang.Object;)Ljava.lang.Object;(Native Method)
           at jrockit.reflect.CompiledMethodInvoker.invoke(Ljava.lang.Object;[Ljava.lang.Object;)Ljava.lang.Object;(Unknown Source)
           at java.lang.reflect.Method.invoke(Ljava.lang.Object;[Ljava.lang.Object;J)Ljava.lang.Object;(Unknown Source)
           at org.jgroups.blocks.MethodCall.invoke(MethodCall.java:321)
           at org.jboss.cache.interceptors.PessimisticLockInterceptor.rollback(PessimisticLockInterceptor.java:336)
           at org.jboss.cache.interceptors.PessimisticLockInterceptor.access$200(PessimisticLockInterceptor.java:31)
           at org.jboss.cache.interceptors.PessimisticLockInterceptor$SynchronizationHandler.afterCompletion(PessimisticLockInterceptor.java:418)


          • 2. Re: Issues with cluster node crashes with REPL_SYNC session

            This is good detective work! Glad that you found out the root. Bela and I have been troubleshooting it for 2 hours last week without success. :-(

            • 3. Re: Issues with cluster node crashes with REPL_SYNC session

              So this is a bigger issue of (relatively) long running transaction context and the node event handling (it can come from same thread or different thread).

              Should we propagate the gtx (global tx, if any) along the event then?

              • 4. Re: Issues with cluster node crashes with REPL_SYNC session
                brian.stansberry

                 

                "ben.wang@jboss.com" wrote:
                So this is a bigger issue of (relatively) long running transaction context and the node event handling (it can come from same thread or different thread).

                Should we propagate the gtx (global tx, if any) along the event then?


                When you say propagate the gtx along with the event, I'm guessing you mean something like changing the TreeCacheListener interfaces to something like

                public void nodeModified(Fqn fqn, GlobalTransaction gtx)


                (or more likely pass an Event object that encapsulates the two params).

                The problem there is the listener can't do anything with the gtx; we'd have to overload the TreeCache API with versions that take a gtx as an arg.

                We could propagate it as a ThreadLocal. But if the listener wanted to call back into the cache on a different thread, they would have to propagate the gtx to that thread.

                If we just focus on the case where the callback into the cache is coming from the same thread that generated the event, I think the key is just making sure that the lock interceptor code finds the gtx if the transaction associated with the thread is STATUS_ROLLING_BACK.

                I've opened a separate forum thread (http://www.jboss.com/index.html?module=bb&op=viewtopic&t=73404) for this specific issue, as it's a global JBossCache issue. I figure going forward on this thread I'll raise any further issues more specific to HttpSession replication.

                • 5. Re: Issues with cluster node crashes with REPL_SYNC session
                  brian.stansberry

                   

                  "bstansberry@jboss.com" wrote:

                  This should be fixable by having JBossCacheService ignore nodeModified() events that result from its own activity. Ben has some code like that in HEAD.

                  But that's a session-replication-only fix and this kind of thing could affect any TreeCache usage with cache listeners.


                  Just to document what we've learned through testing....

                  Ignoring nodeModified() events generated by yourself doesn't solve much even in the session-replication case, as that only works on the cluster node that initiates the replication. In the case of other nodes that are doing a rollback due to a tx timeout (after the initiator of a global tx crashes after a sending a prepare() msg), it's not possible to use this technique. The issue has to be fixed at the JBossCache level.

                  • 6. Re: Issues with cluster node crashes with REPL_SYNC session
                    brian.stansberry

                    I've fixed JBCACHE-359 on the JBossCache_1_2_4_SP1 branch. With that fix, I was able to run the JBento test with 300 threads and kill the coordinator. The cluster behaved properly. Session replications that were initiated between the time the coordinator was killed and when a new view was accepted failed (this is expected), but no further failures occured. This worked with both the default JGroups config in tc5-cluster-service.xml and with one that largely matches fc-fast-minimal-threads.xml.

                    I'm going to leave JBAS-2170 open for a bit in case other issues crop up as we do further testing.

                    • 7. Re: Issues with cluster node crashes with REPL_SYNC session
                      brian.stansberry

                       

                      "bstansberry@jboss.com" wrote:
                      Session replications that were initiated between the time the coordinator was killed and when a new view was accepted failed (this is expected), but no further failures occured.


                      One open question is what to do about these replication attempts that failed. With an InstantSnapshotManager, it could be possible for the snapshot manager to retry the replication. This would probably be a bad idea with IntervalSnapshotManager; the http request that initiated the replication has already returned, so the risk of concurrency issues is higher.

                      • 8. Re: Issues with cluster node crashes with REPL_SYNC session

                        1. Yes, event bundled with gtx will need to wait untill JBoss Cache 2.0. Or else, we need to create an additional TreeCacheListenerV2 (:-)) interface (if this turns out to be critical).

                        2. As for re-try, it is probably ok for IntervalSnapshotManager not to re-try since people that use that probably can tolerate that already. But for InstantSnapshotManager? What do other people think?

                        • 9. Re: Issues with cluster node crashes with REPL_SYNC session
                          belaban

                          I don't see that we get an exception or TX rollback when we PREPARE over A,B,C,D and only receive responses from A,B and C, but *not* from D. This is captured in http://jira.jboss.com/jira/browse/JBCACHE-358.