2 Replies Latest reply on Jul 28, 2015 12:25 PM by ma6rl

    InvalidItemStateException when performing multiple updates on a node

    ma6rl

      I'm seeing the following error from time to time in my environment

       

      Caused by: javax.jcr.InvalidItemStateException: This session tried to save changes to node with key 'Cannot locate child node: 4c24b267505d649a2c9bcd-b65b-494b-a15d-e791d40f6ac4 within parent: 4c24b267505d64df01b090-ef0d-4e4e-b80a-b52a832c988a', but it was removed by another session.
      

       

      I'm using Modeshape 4.3 (via the Wildfly Sub-System), and have configured Infinispan to use READ_COMMITTED, PESSIMISTIC locking with NON_DURABLE_XA transactions. The issue normally occurs when running on a resource constrained environment. I am still trying to pin down the exact cause and provide a test case to help debug this. I believe it occurs when multiple sequential updates are made to the same node in different sessions/transactions in a short time period. I have not seen this issue on faster machines and only started seeing the issue after fixing a performance issue in my custom modeshape authentication code that was slowing down node updates. I can also work around the issue by throttling the updates. Currently if I limit it to 1 update every 100ms I do not see this issue.

       

      My working theory is there may be some sort of lag in the LazyCachedNode implementation that causes an out of date to view of the node to be presented to the new session/transaction if it starts to soon after the last session was saved an transaction committed.

       

      I am aware of [MODE-2216] Move operation on cluster causes InvalidItemStateException exception - JBoss Issue Tracker that addressed some issues in this area last year. My question is should I re-open this issue or create a new one and link it? I am also going to try and provide a test case to help you track this issue down but as of yet have failed to create an isolated/repeatable test which demonstrates the issue. This is mostly because it is timing related which may make it hard to test or prove it is fixed even if we do re-create it.