1 Reply Latest reply on Jan 4, 2008 12:13 PM by Manik Surtani

    JBCACHE-1251

    Brian Stansberry Master

      Thread for discussion of http://jira.jboss.com/jira/browse/JBCACHE-1251.

      Some further analysis of the problem; I don't really know the proper fix as I'm not familiar enough with all the subtle cases this code is handling.

      Analyzing this in debugger, it seems the problem is in inconsistencies in handling deleted/invalidated nodes in PLU.acquireLocksWithTimeout() and PLI.lock():

      In acquireLocksWithTimeout():

       do
       {
       // this is an additional check to make sure we don't try for too long.
       if (!firstTry && System.currentTimeMillis() > cutoffTime)
       {
       throw new TimeoutException("Unable to acquire lock on Fqn " + fqn + " after " + timeout + " millis");
       }
       created = lock(ctx, fqn, lockType, createIfNotExists, timeout, acquireLockOnParent, reverseRemoveCheck);
       firstTry = false;
       }
       while (createIfNotExists && cache.peek(fqn, true) == null);// keep trying until we have the lock (fixes concurrent remove())
      


      The cache.peek() call in the while is really a convenience version of cache.peek(fqn, true, false). This last "false" param means the loop will continue as long as the peek only finds an "invalid" tombstone.

      This is what is happening. Question is why a new "valid" node isn't created in the lock() call.

      Looking at lock, a couple things pop out:

      1) Creating a new node will only happen if the current node doesn't exist. In this case the current node does exist, it's just invalid. I don't see any logic in the loop to handle this.

      2) There's logic to try to detect deleted/orphan nodes that seems odd:

       // make sure the lock we acquired isn't on a deleted node/is an orphan!!
       // look into invalidated nodes as well
       NodeSPI repeek = cache.peek(currentNode.getFqn(), true, true);
       if (currentNode != repeek)
       {
       if (log.isTraceEnabled())
       log.trace("Was waiting for and obtained a lock on a node that doesn't exist anymore! Attempting lock acquisition again.");
       // we have an orphan!! Lose the unnecessary lock and re-acquire the lock (and potentially recreate the node).
      


      Idea here seems to be to relook up the current node from the cache but ignore deleted/invalid nodes. If the node we have in hand isn't the one the cache finds, there's an issue that we deal with.

      Shouldn the cache.peek call be cache.peek(currentNode.getFqn(), false, false); -- i.e. ignore deleted/invalid?

      Hmm, maybe not, I think I see the idea. You're *only* checking for nodes that are "floating in space". Nodes that are still part of the main cache structure, no matter what their status, are OK.

      In that case, I think issue #1 is the problem -- there's no mechanism to create a new node if the existing one is invalid, or to somehow "resurrect" the invalid one.

      OT: on the repeek thing -- that seems like quite a bit of overhead since it's repeated on every node from the root down -- for every call. Could that check be limited to the last node in the hierarchy?

      Alternative (which requires SPI change) is to add a boolean "orphan" flag to a node, or to make AbstractNode.deleted an enum with "FALSE, DELETING, DELETED" or something. Some way to change the state of the node itself when it's been cut loose from the tree.

        • 1. Re: JBCACHE-1251
          Manik Surtani Master

           

          "bstansberry@jboss.com" wrote:

          This is what is happening. Question is why a new "valid" node isn't created in the lock() call.


          Yes, this is the root of the bug. Where we add nodes if the node does not exist and createIfNotExists == true, we should also set the node to valid if createIfNotExists == true and the node exists but is not valid.

          "bstansberry@jboss.com" wrote:

          Hmm, maybe not, I think I see the idea. You're *only* checking for nodes that are "floating in space". Nodes that are still part of the main cache structure, no matter what their status, are OK.


          Yes, this is needed to prevent problems with concurrent deletes and puts.

          "bstansberry@jboss.com" wrote:

          OT: on the repeek thing -- that seems like quite a bit of overhead since it's repeated on every node from the root down -- for every call. Could that check be limited to the last node in the hierarchy?


          Not really - a delete could happen on a parent with a concurrent add on a child with READ_COMMITTED. That said, this could be optimised for R_R and then a separate check on every level for R_C.

          In the end the ugliness of all this is because locks are a part of a node and if a node does not exist attempting to create one (or concurrently delete one) means that 2 txs may acquire locks on different lock objects, on different nodes, sharing the same Fqn. Stuff that will go away with 3.x as I hope to use a single LockManager that maps locks to Fqn objects. This will still be accessible from the Node - to be compatible with current code - but Node.getLock() will ask the NodeManager for the lock by passing in the Node's Fqn.

          "bstansberry@jboss.com" wrote:

          Alternative (which requires SPI change) is to add a boolean "orphan" flag to a node, or to make AbstractNode.deleted an enum with "FALSE, DELETING, DELETED" or something. Some way to change the state of the node itself when it's been cut loose from the tree.


          Invalid nodes may live longer than just the course of an invocation, and represents more than the partial state of deletion - JBCACHE-1155