2 Replies Latest reply on Oct 17, 2014 1:58 PM by acwest

    TimeoutException and deadlock detection with a twist

    phillip.atkinson

      We've been seeing many TimeoutExceptions using Infinispan (6.0.2, replicated, READ_COMMITTED, pessimistic, sync, use batching). After reading a few other similar posts here and the UserGuide in general, I enabled deadlockDetection and ran my tests again (to try to narrow down the cause, the cluster only had 1 node). We still saw the TimeoutExceptions, and the logs had messages that looked like this:

       

      ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (RulesEngine_Work-9) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key [-1796858716] for requestor [DldGlobalTransaction{coinToss=-5668811704046327045, lockIntention=-1796858716, affectedKeys=[], locksAtOrigin=[]} GlobalTransaction:<iems-PowerEdge-T110-II-963>:531973:local]! Lock held by [DldGlobalTransaction{coinToss=8319531552131518214, lockIntention=null, affectedKeys=[], locksAtOrigin=[]} GlobalTransaction:<iems-PowerEdge-T110-II-963>:531966:local]

       

      My questions are not so much about how to fix this, but about what this log message is telling me. It looks like LockManager is trying to get a lock on key -1796858716, but it times out trying to do so because another thread has the lock (both transactions say "local"). The "lockIntention" looks like it should be the key we're trying to lock on, but the held-lock looks like it is saying that the key that is holding the lock is "null".

       

      So in addition to any other insight into this log message you might be able to give me, my questions are:

       

      1. How can there be a lock on key "null"? I have code that checks for nulls before trying to lock anything. I also enabled trace logging for infinispan, and I don't see any log message like "Setting local lock intention to null" (from DldGlobalTransaction.setLockIntention )

      2. Even if somehow there is a lock on the key "null", why does this interfere when another thread tries to lock on a different key?

       

      Thanks in advance!

        • 1. Re: TimeoutException and deadlock detection with a twist
          william.burns

          Phillip Atkinson wrote:

           

          We've been seeing many TimeoutExceptions using Infinispan (6.0.2, replicated, READ_COMMITTED, pessimistic, sync, use batching). After reading a few other similar posts here and the UserGuide in general, I enabled deadlockDetection and ran my tests again (to try to narrow down the cause, the cluster only had 1 node). We still saw the TimeoutExceptions, and the logs had messages that looked like this:

           

          ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (RulesEngine_Work-9) ISPN000136: Execution error: org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key [-1796858716] for requestor [DldGlobalTransaction{coinToss=-5668811704046327045, lockIntention=-1796858716, affectedKeys=[], locksAtOrigin=[]} GlobalTransaction:<iems-PowerEdge-T110-II-963>:531973:local]! Lock held by [DldGlobalTransaction{coinToss=8319531552131518214, lockIntention=null, affectedKeys=[], locksAtOrigin=[]} GlobalTransaction:<iems-PowerEdge-T110-II-963>:531966:local]

           

          My questions are not so much about how to fix this, but about what this log message is telling me. It looks like LockManager is trying to get a lock on key -1796858716, but it times out trying to do so because another thread has the lock (both transactions say "local"). The "lockIntention" looks like it should be the key we're trying to lock on, but the held-lock looks like it is saying that the key that is holding the lock is "null".

           

          So in addition to any other insight into this log message you might be able to give me, my questions are:

           

          1. How can there be a lock on key "null"? I have code that checks for nulls before trying to lock anything. I also enabled trace logging for infinispan, and I don't see any log message like "Setting local lock intention to null" (from DldGlobalTransaction.setLockIntention )

          2. Even if somehow there is a lock on the key "null", why does this interfere when another thread tries to lock on a different key?

           

          Thanks in advance!

          Lock intention is only set on the transaction when it is trying to acquire a lock so we can detect the deadlock.  The fact that the other thread has null means it isn't currently trying to acquire a lock, it already has all the ones it has asked for.

           

          What I would do is enable tracing and when you find this error logged, look for other occurrences of the GlobalTransaction id when it was started so you can find which thread is still holding onto the lock and then from there you can figure out why it was held for such a long duration.  Usually this is just caused by not completing the transaction in a timely manner, especially since you are running in pessimistic mode, you may have a long duration period between when you acquired the lock and the call to commit/rollback.  It could also be just a lot of contention on the same lock, but this isn't as likely.

          • 2. Re: TimeoutException and deadlock detection with a twist
            acwest

            Something that would be VERY useful in these situations is some indication of what is currently holding the lock. I presume that "Lock held by [DldGlobalTransaction{coinToss=8319531552131518214, lockIntention=null, affectedKeys=[], locksAtOrigin=[]} GlobalTransaction:<iems-PowerEdge-T110-II-963>:531966:local]" is the useful part for this, but nothing in it seems to map to anything. Is the number 531966 at the end a thread id? A thread name would be a lot more useful