2 Replies Latest reply on Jul 15, 2013 4:56 AM by zxeka

    Lock.tryLock() in suspecting node are hanging forever after false Failure Detection




      We are using jgroups-3.0.3.Final as a cluster wide locking implementation in a cluster of two nodes.

      Our JGroups settings(simplified) is as follows:

      <config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.0.xsd">

         <TCP bind_port="7800" ... />

         <TCPPING ... />

         <MERGE2  min_interval="10000" max_interval="30000"/>


         <FD timeout="3000" max_tries="3" />

         <VERIFY_SUSPECT timeout="1500" />




      We perform lock/unlock as follows:

      Lock lock = getLockService().getLock("mylock");




         //do something






      We are expecting false failure detection several times a day, probably because of too low timeout value of FD. What is worse, we often have several locks hanging forever if they were obtained during such false FD.

      scenario is this:

      1. We have a cluster view of {A,B|1}
      2. Wait until failure detected, but both nodes are alive (false FD).
      3. Node A will suspect Node B and create new view, {A|2}
      4. Suspected Node B will be still in view {A,B|1}.
      5. Node B is trying to obtain a lock "mylock".
      6. Node A discards grant lock messages from Node B, as it is in different view.
      7. View merge is performed, and new view is created - {A,B|3}

      Problem: a thread which try to get "mylock" hangs in lock.tryLock(); line, each subsequent attempts to get "mylock" fail as well.

      We have used tryLock(long time, TimeUnit unit) with timeout specified, and seems it solved the problem.

      Question: Does it means that JGroups impl. of Lock.tryLock() without timeout have a bug and should be avoided?