7 Replies Latest reply on Mar 13, 2017 10:00 AM by illia.khokholkov

[ModeShape 5.x] Permanently locked JCR node

illia.khokholkov Mar 2, 2017 6:37 PM

The issue described in this post is similar to [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node. I think I have found a way to reproduce it every so often. The gist of it is that when stressed with concurrent writes in a clustered environment, a given lockable JCR node can become permanently locked, when neither implicit unlock upon lock timeout, nor multiple attempts to lock/unlock the node, nor taking down the cluster and bringing it back online can resolve the issue.

Here is an overview of the test case located at modeshape-cluster-test/NodeCorruptionTest.java at 2b386d4fd5f654bb595ed788a2a8597adff0606c · dnillia/modeshape-cluster-t… :

Set up a clustered environment that consists of 5 members. The data will be persisted in the H2 database backed by filesystem. JGroups communication is enabled.
Create a lockable parent node.
Check whether a parent node can be locked/unlocked. If not, check whether the node claims to be unlocked, but has lockOwner and isDeep properties set. If so, interrupt the test.
Having 25 threads in the pool, submit 25 tasks, where each one will attempt to add a new child node. Every task will be attempted 5 times. Delay across attempts is preconfigured. Here is what is happening in a single task:
1. Lock parent node for at most 15 seconds using shallow, open-scoped JCR lock.
2. Add child node.
3. Save session.
4. Unlock parent node.
Once all tasks are done:
1. Wait 20 seconds for the parent lock, if any, to expire.
2. Check whether the parent node can be locked/unlocked. Perform this operation for up to 5 times. If none of the attempts succeeds, that would indicate a corruption of the node.

The steps necessary to run the test case are presented below:

git clone https://github.com/dnillia/modeshape-cluster-test.git
git checkout 2b386d4fd5f654bb595ed788a2a8597adff0606c
mvn clean install -pl modeshape-cluster-test-common,modeshape-cluster-test-standalone -Dsurefire.failIfNoSpecifiedTests=false -Dtest=NodeCorruptionTest -Ddb.url="jdbc:h2:file:./h2/content/db;DB_CLOSE_DELAY=-1" -DenableAssertions=false -DtrimStackTrace=false

Important details regarding the consistency of the provided test case:

Do not erase the H2 filesystem DB between runs by ensuring that db.url property is provided and its value is configured to be somewhere outside the target directory.
The corruption does not happen all the time. On my machine and the other one I tried, the node becomes permanently locked on a second attempt to run the test. Please, run the same test multiple times and the corruption of the parent node should occur. When it happens, every subsequent attempt to run the same test should fail.

For the record, I was able to reproduce this issue even with a single member in the cluster, but that happened only once or twice out of dozens of attempts. Therefore, the provided test has a significant number of members in the cluster, which should make the potential problem more evident and more often reproducible. As I mentioned earlier, please do run the test multiple times. hchiorean, it would be great if you could find some time to look into this issue and confirm/deny whether it represents a bug. Thank you.

1. Re: [ModeShape 5.x] Permanently locked JCR node

hchiorean Mar 3, 2017 2:19 AM (in response to illia.khokholkov)

The one thing I would suggest is trying your scenario using db locking, not the default JGroups locking. You can enable that by adding a "locking" : "db" attribute to the clustering section of the configuration. JGroups locking is unreliable in a cluster and will be removed most likely in the next major version of ModeShape.

If the problem still persists, you should retry your test case once [MODE-2670] Internal repository locks are not released if user transactions are created and rolled back from different t… is resolved since that will change a few things around locking and transactions. Should the same issue still be present even after that, feel free to log a JIRA attaching this discussion.

As far as I'm concerned, after I fix MODE-2670 I will focus on something else because these complex test cases simply take up too much of my time. If in the meantime someone else in the community proposes a fix, I'm more than happy to review any PRs.
Actions
2. Re: [ModeShape 5.x] Permanently locked JCR node

illia.khokholkov Mar 3, 2017 9:23 AM (in response to hchiorean)

Thank you for the feedback. The submitted test case had DB locking enabled for clustering, as configured in modeshape-cluster-test/test-repository-h2.json at 2b386d4fd5f654bb595ed788a2a8597adff0606c · dnillia/modeshape-cluster-t… . I will definitely retry running the test after [MODE-2670] Internal repository locks are not released if user transactions are created and rolled back from different t… is resolved. And, as you suggested, I will log a bug if the problem persists. My apologies for the complex test case, but that was the only way for me to somewhat consistently reproduce the issue, which by itself is rather complex, I would think.
Actions

3. Re: [ModeShape 5.x] Permanently locked JCR node

illia.khokholkov Mar 6, 2017 7:30 PM (in response to illia.khokholkov)

hchiorean, in case you do find time to look further into the issue, here is a stack trace that I am getting when corruption happens:

Caused by: java.lang.NullPointerException
    at org.modeshape.jcr.cache.NodeKey.hashCode(NodeKey.java:202)
    at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1012)
    at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
    at org.modeshape.jcr.cache.document.WorkspaceCache.lambda$loadFromDocumentStore$31(WorkspaceCache.java:340)
    at java.util.ArrayList.forEach(ArrayList.java:1249)
    at org.modeshape.jcr.cache.document.WorkspaceCache.loadFromDocumentStore(WorkspaceCache.java:335)
    at org.modeshape.jcr.cache.document.WritableSessionCache.lockNodes(WritableSessionCache.java:1546)
    at org.modeshape.jcr.cache.document.WritableSessionCache.save(WritableSessionCache.java:681)
    at org.modeshape.jcr.RepositoryLockManager.lock(RepositoryLockManager.java:382)
    ... 33 more

This is different from [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node, because in that case I was getting LockException as opposed to NullPointerException.

4. Re: [ModeShape 5.x] Permanently locked JCR node

illia.khokholkov Mar 6, 2017 7:44 PM (in response to illia.khokholkov)

Here is another form in which permanent corruption manifests itself:

Caused by: org.modeshape.jcr.cache.DocumentAlreadyExistsException: 750ed5b317f1e7mode:lock-750ed5b7505d648e7134af-0a34-43be-af80-52348209f175
    at org.modeshape.jcr.cache.document.WritableSessionCache.persistChanges(WritableSessionCache.java:1324)
    at org.modeshape.jcr.cache.document.WritableSessionCache.save(WritableSessionCache.java:690)
    at org.modeshape.jcr.RepositoryLockManager.lock(RepositoryLockManager.java:382)
    ... 32 more

5. Re: [ModeShape 5.x] Permanently locked JCR node

illia.khokholkov Mar 7, 2017 10:13 PM (in response to illia.khokholkov)

After consuming the fix for [MODE-2670] Internal repository locks are not released if user transactions are created and rolled back from different t…, I have encountered more problems and ended up logging [MODE-2672] The node may become permanently locked after concurrent writes - JBoss Issue Tracker. hchiorean, it would be great if you could find some time to look at the issue logged, thank you.
Actions
6. Re: [ModeShape 5.x] Permanently locked JCR node

hchiorean Mar 9, 2017 8:09 AM (in response to illia.khokholkov)

illia.khokholkov I've reopened MODE-2670 because there's an additional issue with clustering initialization that I missed.

With the changes I have locally, I ran NodeCorruptionTest with 10 nodes in the cluster several times, and I've never gotten any failure on my machine. So if you'll see the same issues described above, you'll have to investigate this in your own environment.
Actions
7. Re: [ModeShape 5.x] Permanently locked JCR node

illia.khokholkov Mar 13, 2017 10:00 AM (in response to hchiorean)

Thank you for the quick update, your help is greatly appreciated. However, I am still having issues with the permanent corruption of the node, at least on my machine (see comment [1]). I will try running the same test case on some other machines that are available to me.

[1] [MODE-2672] The node may become permanently locked after concurrent writes in cluster - JBoss Issue Tracker
Actions

Go to original post