7 Replies Latest reply on Nov 10, 2016 3:24 AM by hchiorean

    [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node

    illia.khokholkov
      Problem

       

      In our application we utilize shallow open-scoped locks. The application has been running successfully for some time, but then a problem occurred. One of the nodes became permanently corrupted. Here is how corruption reveals itself as identified by the manual testing:

      • Attempt to lock the node and get an error saying that the node is already locked.

       

      javax.jcr.lock.LockException: The node at '/test' is already locked
          at org.modeshape.jcr.RepositoryLockManager.lock(RepositoryLockManager.java:393)
          at org.modeshape.jcr.JcrLockManager.lock(JcrLockManager.java:276)
          at org.modeshape.jcr.JcrLockManager.lock(JcrLockManager.java:240)
      

       

      • Attempt to unlock the node, having sufficient privileges to do so, and get an error saying that the node is not locked.

       

      javax.jcr.lock.LockException: The node at location '/test' is not locked
          at org.modeshape.jcr.RepositoryLockManager.unlock(RepositoryLockManager.java:446)
          at org.modeshape.jcr.JcrLockManager.unlock(JcrLockManager.java:308)
          at org.modeshape.jcr.JcrLockManager.unlock(JcrLockManager.java:286)
      

       

      Notable things about the problematic node (where "lockManager" is an instance of "javax.jcr.lock.LockManager" and "node" is an instance of "javax.jcr.Node"):

      • lockManager.isLocked("/test") - returns "false"
      • lockManager.holdsLock("/test") - returns "false"
      • lockManager.getLockTokens() - returns an empty array

       

      • node.isLocked() - returns "false"
      • node.getProperty(JcrLexicon.LOCK_OWNER.getString()) - returns a value representing the owner
      • node.getProperty(JcrLexicon.LOCK_IS_DEEP.getString()) - returns "false"

       

      Notes

       

      We do utilize DB locking:

       

      "clustering" : {
          "clusterName" : "${...}",
          "configuration" : "${...}",
          "locking" : "db"
      }
      

       

      The typical lock usage pattern for node locking looks like this (when locking, lock timeout is set to 10 minutes):

       

      lockManager.lock(...);
      try {
          // do something
      
      } finally {
          lockManager.unlock(...);
      }
      

       

      Questions

       

      1. How is it possible that "LockManager#lock(...)" and "LockManager#unlock(...)" see inconsistent state and essentially contradict each other?
      2. How could a single node become corrupted in such a way? So far, out of many nodes, that one is the only problematic one. Additionally, I am unable to simulate this state so I cannot confirm/deny that this is a bug.
      3. I need to unlock the node, what are my options? My attempts to remove the "JcrLexicon.LOCK_OWNER" and "JcrLexicon.LOCK_IS_DEEP" failed per JCR 2.0 specification, stating that protected properties cannot be removed by the client. I could fork the source code of the ModeShape to remove the restriction regarding protected properties, attempt to remove those properties and see if things get back to normal, but I would rather not do that.
      4. If nothing else, would backup/restore procedure work for me, assuming I manually edit the JSON file, produced by the backup procedure, to remove properties that should not exist? Speaking of backup/restore, does version history get preserved? Do I need a clean DB schema (in terms of Oracle) or I can attempt to restore into the existing one?

       

      Any help is greatly appreciated. It would be awesome if rhauch and hchiorean could take a look at this as well. Thank you.

        • 1. Re: [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node
          hchiorean

          First, DB locking and JCR locking are completely different things: DB locking is used for enforcing exclusive write semantics in a transactional sense, while JCR locking is a JCR spec feature.

           

          JCR lock corruption means essentially that some internal locking state (i.e. nodes under under /jcr:system/mode:locks) aren't cleaned up or contain invalid parent-child references. How this can happen is impossible to tell without a reproducible test case. However, if you're clustering there's a good chance that is causing this problem via some sort of bug.

           

          The only thing I can think of to try and fix this issue (as per question 4) is to do a backup and if you know the key of the node which cannot be unlocked, you can look at the child references under jcr:system/mode:locks for a lock with this particular key and attempt to remove that. Backup/restore should preserve the entire node structure, including the system area.

          • 2. Re: [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node
            zcc39r

            when locking, lock timeout is set to 10 minutes

            BTW, is lock timeout supported by ModeShape?

            • 3. Re: [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node
              hchiorean

              Yes, lock timeout is supported both via the Garbage collection task (modeshape/JcrRepository.java at master · ModeShape/modeshape · GitHub ) and when attempting to lock something which has an expired lock (in the latter case we've had a number of bugs which we fixed in 5.2.0.Final, so earlier versions may get inconsistent behavior in this case)

              • 4. Re: [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node
                illia.khokholkov

                Thank you for providing your feedback, it is greatly appreciated. In terms of backup/restore, could you please take another look at the following question that was on the list of my initial questions?

                Do I need a clean DB schema (in terms of Oracle) or I can attempt to restore into the existing one?

                To clarify, will the following sequence of steps work for me (assuming application that utilizes ModeShape is offline, so no traffic will go to the database)?

                1. Back up from schema_1.
                2. Remove offending node/properties.
                3. Restore to schema_1.

                So, can I use the same Oracle schema or do I need to have a new one for the restore procedure to work as expected?

                • 5. Re: [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node
                  hchiorean

                  When restoring a repository from a previous backup, the current code will attempt to clean/remove all existing repository data before restoring. In other words, restoring only works from an empty schema.

                  If you want to be on the safe side though, especially if you suspect data may have been corrupted, I would recommend explicitly dropping the schema first.

                  • 6. Re: [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node
                    zcc39r

                    lock timeout is supported

                    I mean the timeout parameter of lock method. So ModeShape does not ignore this hint. Right? But when I'm trying to "discover actual timeout by inspecting the returned Lock object", getSecondsRemaining() returns Long.MAX_VALUE irrespective of timeout hint and amount of time passed from the lock creation at least for open-scoped locks within the same thread/transaction that created the lock. Should this behaviour be considered a bug?

                    • 7. Re: [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node
                      hchiorean

                      Yes the getSecondsRemaining()method should take into account the timeoutHint set when creating the lock, so feel free to open a JIRA for this.