The issue described in this post is similar to [ModeShape 5.2.0.Final, Oracle 11g] Corrupted node. I think I have found a way to reproduce it every so often. The gist of it is that when stressed with concurrent writes in a clustered environment, a given lockable JCR node can become permanently locked, when neither implicit unlock upon lock timeout, nor multiple attempts to lock/unlock the node, nor taking down the cluster and bringing it back online can resolve the issue.
Here is an overview of the test case located at modeshape-cluster-test/NodeCorruptionTest.java at 2b386d4fd5f654bb595ed788a2a8597adff0606c · dnillia/modeshape-cluster-t… :
- Set up a clustered environment that consists of 5 members. The data will be persisted in the H2 database backed by filesystem. JGroups communication is enabled.
- Create a lockable parent node.
- Check whether a parent node can be locked/unlocked. If not, check whether the node claims to be unlocked, but has lockOwner and isDeep properties set. If so, interrupt the test.
- Having 25 threads in the pool, submit 25 tasks, where each one will attempt to add a new child node. Every task will be attempted 5 times. Delay across attempts is preconfigured. Here is what is happening in a single task:
- Lock parent node for at most 15 seconds using shallow, open-scoped JCR lock.
- Add child node.
- Save session.
- Unlock parent node.
- Once all tasks are done:
- Wait 20 seconds for the parent lock, if any, to expire.
- Check whether the parent node can be locked/unlocked. Perform this operation for up to 5 times. If none of the attempts succeeds, that would indicate a corruption of the node.
The steps necessary to run the test case are presented below:
git clone https://github.com/dnillia/modeshape-cluster-test.git git checkout 2b386d4fd5f654bb595ed788a2a8597adff0606c mvn clean install -pl modeshape-cluster-test-common,modeshape-cluster-test-standalone -Dsurefire.failIfNoSpecifiedTests=false -Dtest=NodeCorruptionTest -Ddb.url="jdbc:h2:file:./h2/content/db;DB_CLOSE_DELAY=-1" -DenableAssertions=false -DtrimStackTrace=false
Important details regarding the consistency of the provided test case:
- Do not erase the H2 filesystem DB between runs by ensuring that db.url property is provided and its value is configured to be somewhere outside the target directory.
- The corruption does not happen all the time. On my machine and the other one I tried, the node becomes permanently locked on a second attempt to run the test. Please, run the same test multiple times and the corruption of the parent node should occur. When it happens, every subsequent attempt to run the same test should fail.
For the record, I was able to reproduce this issue even with a single member in the cluster, but that happened only once or twice out of dozens of attempts. Therefore, the provided test has a significant number of members in the cluster, which should make the potential problem more evident and more often reproducible. As I mentioned earlier, please do run the test multiple times. hchiorean, it would be great if you could find some time to look into this issue and confirm/deny whether it represents a bug. Thank you.