5 Replies Latest reply on Nov 1, 2013 4:25 PM by fharms

    Inconsistent cache between 2 cluster nodes

    fharms

      Hi

       

      We are from time to time experience inconsistent in JBoss Cache between two nodes. We use JBoss Cache as second level cache for Hibernate

      We have 2 nodes both setup with a jboss cache in a cluster with the “READ_COMMIT” as isolation level and cache mode “INVALIDATION_SYNC”

       

      A snapshot of the jboss cache configuration:


      <entry><key>mvcc-entity</key>
         <value>      
            <bean name="MVCCEntityCache" class="org.jboss.cache.config.Configuration">
               
               <!-- Mode of communication with peer caches.        
                    INVALIDATION_SYNC is highly recommended as the mode for use
                    with entity and collection caches.     -->
               <property name="cacheMode">INVALIDATION_SYNC</property> 
               <!-- Name of cluster. Needs to be the same for all members -->
               <property name="clusterName">${jboss.partition.name:DefaultPartition}-mvcc-entity</property>        
               <!-- Specifies the number of shared locks to use for write locks acquired. -->
               <property name="concurrencyLevel">100000</property>
               <!-- Whether or not to fetch state on joining a cluster. -->
               <property name="fetchInMemoryState">true</property>
               <!-- Must match the value of "useRegionBasedMarshalling" -->
               <property name="inactiveOnStartup">true</property>
               <!-- The isolation level used for transactions. -->
               <property name="isolationLevel">READ_COMMITTED</property>
               <!-- We have no asynchronous notification listeners -->
               <property name="listenerAsyncPoolSize">0</property>
               <!-- Max number of milliseconds to wait for a lock acquisition match the transaction timeout on 5 min-->
               <property name="lockAcquisitionTimeout">300000</property>
               <!-- Specifies whether parent nodes are locked when inserting or removing children. -->
               <property name="lockParentForChildInsertRemove">false</property>
      


      Problem :

      We seen this problem when 2 clients is routed to each server client #1 add a new item to a collection on server #1, The collection on server #2 is invalidated as expected.

      Client #2 is now reading this collection from server #2 which force it read the data from the database into the cache(putForExternalRead). Concurrently client #1 is adding a new item to the same collection on server #1 and invalidate the collection on server #2, this update is finish before Client #2 has inserted all data into the cache. Client #2 is finished populating all data to the cache, but do not see the changes from Client #1, in fact it just insert it's data.

       

      Since the putForExternalRead only will update the cache if it’s not present, it will never read the latest update into cache and by this you end up with a inconsistent cache between the nodes, but also between the cache and the database.

       

      I have tried to illustrate in this simple diagram


      The JBoss Application server is version 5.1, JBoss Cache 3.2.5.GA, Hibernate JBossCache2.x Integration 3.3.2.GA_CP03

       

      Is this a bug or because of a misconfiguration?

       

      Thanks!

      /Flemming

        • 1. Re: Inconsistent cache between 2 cluster nodes
          brian.stansberry

          Hi Fleming,

           

          You may have better luck with questions like this on the Hibernate forums, as this is more an issue with the Hibernate integration than with JBC itself.

           

          There were some fixes in later Hibernate releases related to these kinds of issues with collection caching:

           

          [HHH-3817] JBC second level cache integration can cache stale collection data - Hibernate JIRA

          [HHH-4944] putFromLoad calls could store stale data - Hibernate JIRA

           

          I'm not sure those fixes would directly address what you are seeing, but it's definitely in the area.

          • 2. Re: Re: Inconsistent cache between 2 cluster nodes
            fharms

            We have upgraded to Hibernate 3.5.6.Final and hibernate-entitymanager-3.4.0.GA. It seems to have solve the issue within the same node, but the error still exists in cluster mode.


            On a standalone server Hibernate will register a timestamp in PutFromLoadValidator before reading from the database. Once the database read is completed a lock is acquired and the data is put in the cache. If the key has been invalidated in the meantime, we will not be allowed to acquire the lock, and the cache will not be updated. When a key is invalidated it will also call PutFromLoadValidator to register that a invalidation has happened.


            In a 2 node cluster the invalidation is not registered in PutFromLoadValidator on the other node. Because the synchronization happens on the level below hibernate (Jboss cache). Since the invalidation is not registered on the other node it will be allowed to put stale data in the cache.


            PutFromLoadValidator exists in a similar version for Infinispan in Hibernate 4.2. We therefore doubt that an upgrade to infinispan will solve the problem.


            We have created a figure which tries to explain the problem

            .

            • 3. Re: Inconsistent cache between 2 cluster nodes
              brian.stansberry

              Apologies; I can only talk conceptually today, as opposed to digging into the old code and seeing what's happening.

               

              In case I've gotten confused about which node is which in the scenario, in the following Node 2 is doing the putfromload and Node 1 is where the actual changes are happening.

               

              1) Invalidation events happening in the JBC layer *should* result in callbacks to the Node 2 PutFromLoadValidator stuff. So if the invalidation from Node 1 arrives before Node 2 checks if the putfromload is still valid, that should prevent the putfromload ever occurring. If this isn't what's happening, then there's an issue in the 2LC layer.

               

              2) If the putfromload is accepted as valid and proceeds into the JBC layer before the invalidation message arrives from Node 1, that's where it gets tricky. If the putfromload updates JBC first, then the invalidation message will just clear out the putfromload data and all is well. If the invalidation arrives first and then the putfromload overwrites it, that's where there's a problem.

               

              The putfromload could hold a lock in the 2LC layer (i.e. in PutFromLoadValidator) until the cache is updated. If the callback from the incoming invalidation discussed in step 1 had to acquire that lock, then we could ensure that the putfromload would be complete before the invalidation occurs on Node 2. The thread carrying the invalidation message would block until putfromload was done.

               

              The risky part there is the chance of deadlock -- The putfromload thread acquires a lock in PutFromLoadValidator and then goes to get a lock in JBC; the invalidation message thread gets a lock in JBC and then as part of notification handling needs to get a lock in PutFromLoadValidator. Locks acquired in opposite order == deadlock potential.

               

              I believe though the call putfromload makes into JBC is meant to be "fail fast" -- i.e. if the deadlock scenario occurred, the putfromload wouldn't block trying to get the JBC lock but would just cleanly fail to update, leading to the stale data not being stored. The call stack would unwind, the PutFromLoadValidator lock would be released, and the invalidation thread could proceed.

               

              I suspect this is the way it's all meant to work, as I thought about these scenarios a lot. But maybe there's a bug or maybe I didn't think it through as well as I thought. Hopefully though the above is helpful in understanding what should be happening.

              1 of 1 people found this helpful
              • 4. Re: Inconsistent cache between 2 cluster nodes
                fharms

                Thanks Brian for the answer, that will actual help me a lot in my investigation.

                 

                I the mean time we have tried to upgrade to infinispan core 5.1 and hibernate-infinispan 3.6.1 and that actual seem to solve the issue.

                I'm trying to understand why, so when I know I will let you know.

                • 5. Re: Re: Inconsistent cache between 2 cluster nodes
                  fharms

                  I was too fast to conclude it worked in Infinispan 4.2.1 and Hibernate 3.6.9 cache integration. Because we believe the problem is what is described as number #1


                  Invalidation events happening in the JBC layer *should* result in callbacks to the Node 2 PutFromLoadValidator stuff. So if the invalidation from Node 1 arrives before Node 2 checks if the putfromload is still valid, that should prevent the putfromload ever occurring. If this isn't what's happening, then there's an issue in the 2LC layer.


                  When an invalidation event arrived on node #2 there is no callback to the 2LC layer that will make sure to call invalidate. Another problem we discover was if a collection is invalidated on node #1 twice in a row the second time no invalidation event is broadcasts to node #2, and that’s a problem because in between invalidation event #1 and #2 the PutFromLoadValidator is called on node #2 and this makes it read stale data from the cache.


                  A colleague of my has come up with a patch both for JBoss Cache 3.6 and Hibernate JBoss Cache integration and one for Infinispan 4.2.1.


                  1) To solve the issue with no callback happen to the 2LC layer we added a cache listener and call to the putFromLoadValidator.invalidateKey

                  2) To solve the issue with invalidation event was not sent if it already was invalidated, we have patch the JBoss cache and Infinispan in InvalidateCommand.java.


                  Since we are now invalidating more often this could lead to a potential performance issue or other issues, but it’s hard to see how we can prevent it for caching stale data without sending the events.


                  I will appreciate feedback on the patches


                  I have attach both patches for “hibernate-jbosscache-3.5.6.Final.diff” and “jbosscache-3.2.5.GA.diff”. For Infinispan 4.2.1 and Hibernate orm 3.6.9 “hibernate-infinispan-cache-3.6.9.final.diff” and “

                  “infinispan-core-4.2.1.FINAL.diff”

                   

                   

                  Thanks!


                  br

                  Flemming