5 Replies Latest reply on Jul 27, 2011 6:18 AM by Manik Surtani

    Locks not working as expected in DIST_SYNC cache

    Forman Loop Newbie

      Well, we made it past the joining, but we seem to be missing something on locking.  It seems that we cannot achieve cluster-wide locking.

       

      We are running Infinispan within a JBoss AS 5.1.0.GA cluster of two systems.  Our application accepts JMS messages and places them in the Infinispan cache.  Listeners to the cache on each system awaken and try to lock the just-added node, process it and delete it.

       

      The logic in the listener is as follows:

      public void newEntry( CacheEntryCreatedEvent cece ) {

          TransactionManager tm = cache.getAdvancedCache().getTransactionManager();

          tm.begin();

          String key = cece.getKey().toString();

          cache.getAdvancedCache().lock(key);        // Hoping that this is a cluster-wide lock - one node gets it, the other waits

         

          if (cache.containsKey(key)) {              // May have been removed by listener on the other node

              ... publish on particular JMS topic ...

              cache.remove(key);

          }

          tm.commit();                               // Releases lock

      }

       

      Unfortunately, we are regularly getting errors like the following:

      2011-07-05 15:33:28,129 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (OOB-1,MY_CLUSTER,MyHost-15542) Execution error:

      org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key [ID:JBM-5d328a73-3b57-4a43-b3f4-acc677b1e1e3] for requestor [GlobalTransaction:<OtherHost-43220>:9:remote]! Lock held by [GlobalTransaction:<MyHost-15542>:5:local]

          at org.infinispan.container.EntryFactoryImpl.acquireLock(EntryFactoryImpl.java:228)

          at org.infinispan.container.EntryFactoryImpl.wrapEntryForWriting(EntryFactoryImpl.java:155)

          ...

       

      This confuses us.  The work we do while the key is locked is sub-second work. 

      We often see that both nodes get the error on the same key.

       

      The cache is configured as follows:

          GlobalConfiguration gc = GlobalConfiguration.getClusteredDefault();

          gc.setClusterName( "MY_CLUSTER" );

          Properties p = new Properties();

          p.setProperty("configurationFile", "jgroups-tcp.xml");  // Not sure this is required

          gc.setTransportProperties(p);

         

          Configuration c = new Configuration();

          c.setCacheMode( Configuration.CacheMode.DIST_SYNC );    // ********

          c.setUseEagerLocking(true);                             // Seems necessary

          c.setSyncCommitPhase(true);

          c.setSyncRollbackPhase(true);

      //    c.setUseReplQueue(false);

          c.setIsolationLevel(IsolationLevel.READ_COMMITTED);

          c.setTransactionManagerLookupClass("org.infinispan.transaction.lookup.GenericTransactionManagerLookup" );

         

          MarshallerFactory foo = (MarshallerFactory) Thread.currentThread().getContextClassLoader()

              .loadClass("org.jboss.marshalling.river.RiverMarshallerFactory").newInstance();

       

          EmbeddedCacheManager cm = new DefaultCacheManager( gc, c );

          cache = cm.getCache( "MY_CACHE" );

       

      What have we missed?  It seems that the cluster-wide locking is not working for us.

       

      Is our model not a good one for Infinispan use?  Is there something we could do to better match Infinipsan's strengths?

      Are we just not configuring things correctly to achieve cluster-wide locking?

      Are we misuing the transactions or lock calls?

       

      Any hints will be appreciated.

        • 1. Re: Locks not working as expected in DIST_SYNC cache
          Manik Surtani Master

          Are you using Infinispan purely as a distributed lock to synchronize access to a JMS topic?  This is abuse of Infinispan.  It isn't a distributed lock API, but rather a distributed data structure.  However, if you really wanted to hack Infinispan to be used as a distributed lock API, you would want to make sure all lock acquisitions are both ordered and pessimistic.  For this, I suggest using a customised JGroups config file (copy the one shipped with Infinispan and modify) to add a total order protocol to the JGroups stack (SEQUENCER does this).  This will slow things down, but will guarantee that cluster-wide lock acquisitions happen in order and you won't see the deadlocks that cause your problem above.

          1 of 1 people found this helpful
          • 2. Re: Locks not working as expected in DIST_SYNC cache
            Forman Loop Newbie

            Manik,

             

            Thanks for your reply.

             

            This is not just a distributed lock.  It is basically a backing-store to JMS.  We accept JMS messages from various sources and need to reliably deliver them to our clients, again through JMS.  Only one system should pub any particular message to the clients.

             

            Placing the messages into the cache is done by locking the key and then doing the putIfAbsent within a transaction.  A CacheListener awakens seeing a CacheEntryCreated event.  The listener then locks on the key and, if the cache still contains the key (may have been deleted by the other system) pubs the message and deletes the cache entry, all within a transaction.  This should allow only one system to pub the message to the clients.

             

            As in the original post, the cluster-wide locking we were expecting does not seem to happen.

             

            We experimented with SEQUENCE and also CENTRAL/PEER_LOCKing.  Thanks for the pointers.  However, we are getting deadlock exceptions.  Config details below.

             

            Then we experimented with setting EagerLockSingleNode to true and we seem to get the desired behavior – surprisingly.  We are working in a two-system cluster and EagerLockSingleNode says it performs a lock to one remote system.  It seems to have given us the cluster-wide (two system) locking behavior we need.  Are we fooling ourselves?

             

            We are configuring the cache with a file patterned off of the example config files with our additions and minor mods:

             

            <?xml version="1.0" encoding="UTF-8"?>

             

            <infinispan

                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

                  xsi:schemaLocation="urn:infinispan:config:4.2 http://www.infinispan.org/schemas/infinispan-config-4.2.xsd"

                  xmlns="urn:infinispan:config:4.2">

             

               <global>

                  <globalJmxStatistics

                        enabled="true"

                        jmxDomain="org.infinispan"

                        cacheManagerName="SampleCacheManager"/>

                       

                  <transport

                        clusterName="infinispan-cluster"

                        machineId="m1-DJR"

                        rackId="r1" nodeName="Node-A-DJR">

                     <properties>

                        <property name="configurationFile" value="jgroups-tcp.xml" />

                     </properties>

                  </transport>

               </global>

             

               <default>

                  <locking

                     isolationLevel="READ_COMMITTED"

                     lockAcquisitionTimeout="20000"

                     writeSkewCheck="false"

                     concurrencyLevel="5000"

                     useLockStriping="false"

                  />

                  <transaction

                        transactionManagerLookupClass="org.infinispan.transaction.lookup.GenericTransactionManagerLookup"

                        syncRollbackPhase="true"

                        syncCommitPhase="true"

                        useEagerLocking="true"

                        eagerLockSingleNode="true"

                        cacheStopTimeout="30000" />

                  <deadlockDetection enabled="true" spinDuration="1000"/>

                  <jmxStatistics enabled="true"/>

                  <clustering mode="replication">

                     <stateRetrieval

                        timeout="20000"

                        fetchInMemoryState="false"

                        alwaysProvideInMemoryState="false"

                     />

                     <sync replTimeout="20000"/>

                  </clustering>

               </default>

             

                <namedCache name="OUR_CACHE">

                  <clustering mode="distribution">

                     <sync/>

                     <hash

                        numOwners="2"

                        rehashWait="120000"

                        rehashRpcTimeout="600000"

                     />

                     <l1

                        enabled="false"

                        lifespan="600000"

                     />

                  </clustering>

               </namedCache>

              

            </infinispan>

             

            The jgroups-tcp.xml is taken from the examples as well.  It is in there that we added the SEQUENCE block:

               <SEQUENCER ergonomics="false"

                          level="TRACE"

               />

            We tried this block at both the top and bottom of the config block, but continued to have deadlock issues.  We saw the jgroups sequencer in the logged configuration for jgroups, but nowhere else.

             

            How can we best achieve our goals?

            • 3. Re: Locks not working as expected in DIST_SYNC cache
              Manik Surtani Master

              Hmm, putIfAbsent and similar atomic operations should not be used within the scope of a transaction as this can provide pretty unexpected behaviour.  See this thread for more details.

              1 of 1 people found this helpful
              • 4. Re: Locks not working as expected in DIST_SYNC cache
                Forman Loop Newbie

                Manik,

                 

                Thanks for the pointer and warning about putIfAbsent.  We read many of the posts and had high hopes that replacing putIfAbsent() with a simpler put() would suddenly make everything work.  Unfortunately, we continued to have problems - some new and soluble (e.g. ClassNotFound on UUID) and others related to locking.

                 

                We're back to wondering if we are trying to make Infinispan do things for which it was not designed.  It seems that having a few writers to the cache, many readers and fairly long data lifetimes is a supported paradigm.

                 

                We have two writers, two readers and data lifetimes in the milliseconds.  We anticipated that eager locking would slow us down, but we expected it to work.  We can't get it to work.

                 

                Are we trying to put a square peg in a round hole?

                 

                Thanks again.

                • 5. Re: Locks not working as expected in DIST_SYNC cache
                  Manik Surtani Master

                  So you still get deadlock exceptions when using a put() instead of a putIfAbsent()?  I noticed that your deadlock detection spin duration is very high.  That means it will take quite a while to detect such deadlocks.  Try something like 100ms.