3 Replies Latest reply on Jun 16, 2015 2:44 AM by rvansa

    cannot lock entry after entry owner has been disconnected

    jseparovic

      Hi,

       

      I have a 3 node replicated clustered cache functioning as a high frequency "lock then process" type application. Each iteration tries to lock a record in the cache, if successful it does stuff, then writes back (sometimes). If lock fails, it moves on to the next record.

       

      return new ConfigurationBuilder()

        .locking()

        .concurrencyLevel(3)

        .lockAcquisitionTimeout(3000L)

        .isolationLevel(IsolationLevel.REPEATABLE_READ)

        .useLockStriping(false)

        .clustering()

        .cacheMode(CacheMode.REPL_SYNC)

        .transaction()

        .transactionManagerLookup(new JBossTransactionManagerLookup())

        .lockingMode(LockingMode.PESSIMISTIC)

        .transactionMode(TransactionMode.TRANSACTIONAL)

        .autoCommit(false)

        .build();

       

      *version: jboss-as-clustering-infinispan-7.2.0.Final.jar

       

       

       

      When all 3 nodes are up everything operated as expected. Then when I "ifdown" the owner of the cache entry, the other 2 nodes cannot obtain a lock on this entry until the owner comes back up:

       

      After bringing down the owner, I get a SuspectException, then on the next attempt I get a TimeoutException. Then the timeout exception repeats until the owner node comes back up. (this happens on both the non-owner nodes)

       

      14-Jun-2015 03:40:28,801 DEBUG [CacheContainer] (pool-12-thread-1) TX: lock failed on 5a594cd4-405c-4c50-9086-5deb0bda6571 : org.infinispan.remoting.transport.jgroups.SuspectException : Suspected member: node1/mycache

      Lock info: AbstractPerEntryLockContainer{locks={}}

       

      14-Jun-2015 03:40:28,801 DEBUG [Controller] (pool-12-thread-1) Couldn't Lock: 5a594cd4-405c-4c50-9086-5deb0bda6571

       

      14-Jun-2015 03:40:28,801 DEBUG [CacheContainer] (pool-12-thread-1) TX: rollback


      14-Jun-2015 03:40:33,803 DEBUG [Controller] (pool-12-thread-1) CacheContainer lock info: AbstractPerEntryLockContainer{locks={}}

       

      14-Jun-2015 03:40:33,803 DEBUG [CacheContainer] (pool-12-thread-1) TX: begin

       

      14-Jun-2015 03:40:33,804 DEBUG [CacheContainer] (pool-12-thread-1) TX: attempting lock on 5a594cd4-405c-4c50-9086-5deb0bda6571

       

      14-Jun-2015 03:40:36,809 DEBUG [CacheContainer] (pool-12-thread-1) TX: lock failed on 5a594cd4-405c-4c50-9086-5deb0bda6571 : org.infinispan.util.concurrent.TimeoutException : Could not acquire lock on 5a594cd4-405c-4c50-9086-5deb0bda6571 on behalf of transaction GlobalTransaction:<node2:mycache>:9:local. Lock is being held by null

       

       

      Any ideas how to handle the stale lock once the SuspectException is raised? Should this be handled by infinispan?

       

      Cheers,

       

      Jason Separovic

        • 1. Re: cannot lock entry after entry owner has been disconnected
          rvansa

          SuspectExceptions should be handled transparently; not sure if you just see that in Infinispan logs or if it's thrown to application - it should throw only replication exceptions if the dead node does not reply soon enough (or the TimeoutException on lock acquisition, but not in this case I think). After the node gets suspected, rebalance should take place and another node should become the owner (actually the writes should be possible even during the rebalance). Marking the node as dead usually takes about 10 - 60 seconds (depends on your JGroups configuration). So in your case you should get exceptions several seconds after ifdown, but not for too long.

          • 2. Re: cannot lock entry after entry owner has been disconnected
            jseparovic

            Based on my jgroups config, I can see the suspectException after around 6 seconds of issuing ifdown on node1.

             

            But node2 and node3 then get timeout exceptions continuously "Lock is being held by null". (One test bed still has this null lock since saturday).

             

                        <stack name="tcp">

                            <transport type="TCP" socket-binding="jgroups-tcp" diagnostics-socket-binding="jgroups-diagnostics-tcp"/>

                            <protocol type="TCPPING">

                                <property name="initial_hosts">

                                    node1[7600],node2[7600],node3[7600]

                                </property>

                                <property name="num_initial_members">

                                    2

                                </property>

                                <property name="port_range">

                                    0

                                </property>

                                <property name="timeout">

                                    2000

                                </property>

                            </protocol>

                            <protocol type="MERGE2"/>

                            <protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd">

                            </protocol>

                            <protocol type="FD">

                                <property name="timeout">2000</property>

                                <property name="max_tries">3</property>

                            </protocol>

                            <protocol type="VERIFY_SUSPECT"/>

                            <protocol type="BARRIER"/>

                            <protocol type="pbcast.NAKACK"/>

                            <protocol type="UNICAST2"/>

                            <protocol type="pbcast.STABLE"/>

                            <protocol type="pbcast.GMS"/>

                            <protocol type="UFC"/>

                            <protocol type="MFC"/>

                            <protocol type="FRAG2"/>

                        </stack>

            • 3. Re: cannot lock entry after entry owner has been disconnected
              rvansa

              Ok, the SuspectExceptions should be definitely handled, and "Lock is being held by null" seems a bit strange (after failing to lock it seems that the lock is not locked by anyone). Please, file a JIRA with log set to TRACE level on org.infinispan