1 2 Previous Next 17 Replies Latest reply on Sep 2, 2005 3:12 AM by belaban

    Problems with eviction

    skipy

      I have the following message in the log a lot of times:

      2005-08-26 23:12:55,003 ERROR [.jboss.cache.lock.IdentityLock] () read lock for /subscriber/user/2629 could not be acquired
      by Thread[Thread-12,5,main] after 15000 ms. Locks: Read lock owners: []
      Write lock owner: <10.0.20.90:40519>:13
      , lock info: write owner=<10.0.20.90:40519>:13 (org.jboss.cache.lock.LockStrategyReadCommitted@938729)
      2005-08-26 23:12:55,004 WARN [ss.cache.eviction.LRUAlgorithm] () evictCacheNode(): evictCacheNode time out. Will try later.
      2
      


      As I can understand, this happens after state fetch, i.e. state was fetched from another node and eviction was initiated after some time. So, I can't understand how this node can be locked if no operations with it were performed.

      Regards,
      Eugene

        • 1. Re: Problems with eviction
          belaban

          well, obviously member 10.0.20.90:40519 held a lock, so you need to investigate what heppens on 10.0.20.90:40519. If 10.0.20.90:40519 is the coordinator, then it will read-lock its tree briefly to return the state to a new member. Otherwise, read-locks are generally acquired for get() methods, write-locks for put() and remove() methods.
          The eviction policy also tries to acquire a read-lock when it need to evict an element from memory.
          TreeCache.printLockInfo() dumps all locks held by a cache.

          • 2. Re: Problems with eviction
            skipy

            Well, Bela, I would like to clarify what's going on. It's not about eviction (I switched off eviction and still have the same problem).

            I have two servers - A (10.0.20.90) and B (10.0.20.91). I try to read data on server B (thus, try to obtain read lock). In this situation I have the message above (on server B - 10.0.20.91). I.e., when I try to obtain READ lock on server B, I have a message, that WRITE lock exist with owner - server A. I don't understand how server can own a lock on another server. As far as I understand, all locks are local, since they are implemented using Doug Lea's library. Can you explain how it can be?

            Thank you in advance!

            Regards,
            Eugene

            • 3. Re: Problems with eviction
              belaban

              When you replicate across the cluster (e.g. at the end of a TX), you briefly acquire a lock on each node in the cluster.
              This might be why.
              You can call TreeCache.printLockInfo() to see at any time who's holding locks.

              • 4. Re: Problems with eviction
                skipy

                I know this. I don't understand why write lock persists AFTER replication is finished (if it was caused by replication). Because this lock (and read lock aquisition error!) exists till cache stop. I can try to read value 5 seconds later, 1 minute later, 1 day later - the node will still be locked.

                Regards,
                Eugene

                • 5. Re: Problems with eviction
                  skipy

                  Bela, one more clarification (I'm proceeding with investigations of what's going on).

                  We are testing a kind of failover solution. Thus, we kill JVM and start it again. If we will do so with server B, some of it's write locks could remain on server A. And after start of server A we will have an error I was talking about.

                  We could release write lock on the node manually if we would be able to obtain this node. But the only way to obtain the node is to call get(Fqn) on TreeCache which will cause read lock setting attempt -> exception. So, the question is - can we get somehow the node without acquiring read lock on it?

                  Regards,
                  Eugene aka Skipy

                  • 6. Re: Problems with eviction
                    belaban

                    It would be good if you could describe the exact scenario that leads to those locks remaining held after the end of a TX. There is a JIRA issue for this (http://jira.jboss.com/jira/browse/JBCACHE-10), but I haven't yet gotten to it, because this use case is very rare. It only happens when the committer of a TX dies after the PREPARE round and before the ROLLBACK or COMMIT phase.
                    As a workaround, you can forcefully remove a lock by calling TreeCache.releaseAllLocks(Fqn fqn). This will release all read or write locks (even if not held by you !) in the entire subtree given

                    • 7. Re: Problems with eviction
                      skipy

                      Bela, thank you for answer!

                      It seem to me, that there could be at least two sources of such problems.

                      1. JVM crashes within the replication stage. We kill it under high load (a lot of replications), but sometimes we should do this 5-7 times to have such effect. This is right what JBCache-10 you mentioned is about.

                      2. This problem can be connected to transaction manager. We use JOTM. As I can understand from it's sources - if transaction is timed out, it's removed from thread local, and when rollback is called, IllegalStateException arises. Thus, transaction is not rolled back and nodes remain blocked. For now we are investigating this possibility.

                      We can avoid second problem by playing with timeouts, but the first still remain. It would be VERY nice if TreeCache would be able to release all locks from crashed members, since doing this manualy is quite heavy procedure. I'll try TreeCache.removeAllLocks(Fqn) within a couple of hours.

                      Regards,
                      Eugene aka Skipy

                      • 8. Re: Problems with eviction
                        skipy

                        I've found place in logs where fail occures:

                        I'm writing value under node /subscriber/user/2642, then trying to commit:

                        2005-08-31 13:25:22,892 tcpConnection-6802-11 DEBUG [rrent.locks.dist.AbstractDLock] (t:3624064 u:2642 s:subscriber b:jbroke
                        r) Writing value for FQN=/subscriber/user/2642
                        2005-08-31 13:25:22,892 tcpConnection-6802-11 DEBUG [.jboss.cache.lock.IdentityLock] (t:3624064 u:2642 s:subscriber b:jbroke
                        r) acquireReadLock(): caller <10.0.20.90:45654>:9 already owns lock for /subscriber
                        2005-08-31 13:25:22,892 tcpConnection-6802-11 DEBUG [.jboss.cache.lock.IdentityLock] (t:3624064 u:2642 s:subscriber b:jbroke
                        r) acquireReadLock(): caller <10.0.20.90:45654>:9 already owns lock for /subscriber/user
                        2005-08-31 13:25:22,893 tcpConnection-6802-11 DEBUG [.jboss.cache.lock.IdentityLock] (t:3624064 u:2642 s:subscriber b:jbroke
                        r) upgrading RL to WL for <10.0.20.90:45654>:9, timeout=15000, locks: Read lock owners: [<10.0.20.90:45654>:9]
                        Write lock owner: null
                        2005-08-31 13:25:22,893 tcpConnection-6802-11 DEBUG [.jboss.cache.lock.IdentityLock] (t:3624064 u:2642 s:subscriber b:jbroke
                        r) upgrading lock for /subscriber/user/2642
                        2005-08-31 13:25:22,893 tcpConnection-6802-11 DEBUG [rrent.locks.dist.AbstractDLock] (t:3624064 u:2642 s:subscriber b:jbroke
                        r) Value written for FQN=/subscriber/user/2642
                        2005-08-31 13:25:22,893 tcpConnection-6802-11 DEBUG [rrent.locks.dist.AbstractDLock] (t:3624064 u:2642 s:subscriber b:jbroke
                        r) Commiting TX
                        
                        // 10 seconds pause...
                        
                        2005-08-31 13:25:32,898 tcpConnection-6802-11 WARN [ceptors.ReplicationInterceptor] (t:3624064 u:2642 s:subscriber b:jbroke
                        r) runPreparePhase() failed. Transaction is marked as rolled back
                        org.jboss.cache.lock.TimeoutException: rsp=sender=10.0.20.91:34864, retval=null, received=false, suspected=false
                         at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:2205)
                         at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:2227)
                         at org.jboss.cache.interceptors.ReplicationInterceptor.runPreparePhase(ReplicationInterceptor.java:478)
                         at org.jboss.cache.interceptors.ReplicationInterceptor$SynchronizationHandler.beforeCompletion(ReplicationInterceptor
                        .java:382)
                         at org.jboss.cache.interceptors.OrderedSynchronizationHandler.beforeCompletion(OrderedSynchronizationHandler.java:72)
                         at org.objectweb.jotm.SubCoordinator.doBeforeCompletion(SubCoordinator.java:1487)
                         at org.objectweb.jotm.SubCoordinator.commit_one_phase(SubCoordinator.java:416)
                         at org.objectweb.jotm.TransactionImpl.commit(TransactionImpl.java:239)
                         at org.objectweb.jotm.Current.commit(Current.java:488)
                         at vc.xsys.common.concurrent.locks.dist.AbstractDLock.acquireLock(AbstractDLock.java:201)
                         at vc.xsys.common.concurrent.locks.dist.RWWriteLock.tryLock(RWWriteLock.java:75)
                         at vc.xsys.ui.subscriber.mail.MailUserSaveCommand.execute(MailUserSaveCommand.java:65)
                         at vc.xsys.ui.subscriber.SubscriberCommand.execute(SubscriberCommand.java:78)
                         at vc.xsys.framework.servlet.Action.execute(Action.java:250)
                         at vc.xsys.framework.servlet.AbstractActionConfiguration.executeAction(AbstractActionConfiguration.java:191)
                         at vc.xsys.framework.servlet.ControllerServlet.doRequest(ControllerServlet.java:240)
                         at vc.xsys.framework.servlet.ControllerServlet.doGet(ControllerServlet.java:170)
                         at javax.servlet.http.HttpServlet.service(HttpServlet.java:126)
                         at javax.servlet.http.HttpServlet.service(HttpServlet.java:103)
                         at com.caucho.server.http.FilterChainServlet.doFilter(FilterChainServlet.java:96)
                         at com.caucho.server.http.Invocation.service(Invocation.java:315)
                         at com.caucho.server.http.CacheInvocation.service(CacheInvocation.java:135)
                         at com.caucho.server.http.RunnerRequest.handleRequest(RunnerRequest.java:346)
                         at com.caucho.server.http.RunnerRequest.handleConnection(RunnerRequest.java:274)
                         at com.caucho.server.TcpConnection.run(TcpConnection.java:139)
                         at java.lang.Thread.run(Thread.java:534)
                        2


                        There was no '/subscriber/user/2642' entry in the cache on 10.0.20.91.

                        Regards,
                        Eugene aka Skipy

                        • 9. Re: Problems with eviction
                          skipy

                          But after this fail the following error apper on 10.0.20.91:

                          2005-08-31 13:28:31,292 tcpConnection-6802-7 ERROR [.jboss.cache.lock.IdentityLock] (t:22896576 u:2642 s:subscriber b:jbroke
                          r) read lock for /subscriber/user/2642 could not be acquired by <10.0.20.91:34864>:60 after 15000 ms. Locks: Read lock owners
                          : []
                          Write lock owner: <10.0.20.90:45654>:9
                          , lock info: write owner=<10.0.20.90:45654>:9 (org.jboss.cache.lock.LockStrategyReadCommitted@e80786)
                          


                          I.e., inspite of replication fail write lock was set on both servers and wasn't released.

                          • 10. Re: Problems with eviction
                            skipy

                            BTW, I try to use TreeCache.removeAllLocks(Fqn). It works as expected. Call does through interceptors, thus, read lock can't be acquired on the node by Fqn specified. This method can't help in write lock release.

                            Regards,
                            Eugene aka Skipy

                            • 11. Re: Problems with eviction
                              belaban

                              Yes, you could use TreeCache.peek() and then call Node.releaseAllForce().

                              • 12. Re: Problems with eviction
                                belaban

                                Okay, here's what *should* happen:
                                - TX is committed
                                - TxManager calls beforeCompletion(), which runs the PREPARE phase. This essentially replicates the changes to all nodes except the originating node. This involves acquiring locks on all other nodes.
                                - Now some node cannot acquire the locks, so it throws an exception
                                - The originator of the PREPARE marks the TX as rolled back
                                - The TxManager should now call afterCompletion(status) where the status is failed. The originator now runs the ROLLBACK phase, telling all nodes to roll back the modifications for a given TX. This involves running undo-operations, and releasing the locks on all nodes affected by the given TX.
                                You should debug this (ReplicationInterceptor): if the TxManager *doesn't* call the afterCompletion() method, then we will have hanging locks. Can you verify this ?

                                • 13. Re: Problems with eviction
                                  belaban

                                  By the way: the spec requires that afterCompletion() be called no matter what the outcome (success or failure).

                                  • 14. Re: Problems with eviction
                                    skipy

                                    Bela, It seems that you are right when supposing that afterCompletion is not called. As I can inderstand from sources, exceptions thrown in beforeCompletion are not catched, thus rollback is not performed.

                                    We will try to fix this in JOTM sources and check if the problem will persist. Of course, I'll infrom you about results.

                                    Thank you for your help!

                                    Regards,
                                    Eugene aka Skipy

                                    1 2 Previous Next