2 Replies Latest reply on Jul 15, 2013 4:56 AM by zxeka

    Lock.tryLock() in suspecting node are hanging forever after false Failure Detection

    zxeka Newbie

      Hello,

       

      We are using jgroups-3.0.3.Final as a cluster wide locking implementation in a cluster of two nodes.

      Our JGroups settings(simplified) is as follows:

      <config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">

         <TCP bind_port="7800" ... />

         <TCPPING ... />

         <MERGE2  min_interval="10000" max_interval="30000"/>

         <FD_SOCK/>

         <FD timeout="3000" max_tries="3" />

         <VERIFY_SUSPECT timeout="1500" />

         ...

         <PEER_LOCK/>

      </config>

      We perform lock/unlock as follows:

      Lock lock = getLockService().getLock("mylock");

      try

      {

         lock.tryLock();

         //do something

      }

      finally

      {

         lock.unlock();

      }

      We are expecting false failure detection several times a day, probably because of too low timeout value of FD. What is worse, we often have several locks hanging forever if they were obtained during such false FD.

      scenario is this:

      1. We have a cluster view of {A,B|1}
      2. Wait until failure detected, but both nodes are alive (false FD).
      3. Node A will suspect Node B and create new view, {A|2}
      4. Suspected Node B will be still in view {A,B|1}.
      5. Node B is trying to obtain a lock "mylock".
      6. Node A discards grant lock messages from Node B, as it is in different view.
      7. View merge is performed, and new view is created - {A,B|3}

      Problem: a thread which try to get "mylock" hangs in lock.tryLock(); line, each subsequent attempts to get "mylock" fail as well.

      We have used tryLock(long time, TimeUnit unit) with timeout specified, and seems it solved the problem.

      Question: Does it means that JGroups impl. of Lock.tryLock() without timeout have a bug and should be avoided?

      Thanks.