3 Replies Latest reply on Nov 15, 2004 9:41 AM by tinachen

    lock.TimeoutException crash the replicate-sync cluster

    tinachen

      Hi, All:
      I meet an exception which cause two cache instances crash in a replicate-sync cluster.
      The following is the testing step:
      1. start two instances in one cluster which configured as replicate-sync mode.
      2. start loading data in instance_1
      3. while the instance_1 loading data, kill the instance_2.
      4. instance_1 crash with the following exception:

      org.jboss.util.NestedRuntimeException:
      rsp=sender=WHOUATL2XBXL51:1422, retval=null, received=false, suspected=true; - nested throwable: (org.jboss.cache.lock.TimeoutException: rsp=sender=WHOUATL2XBXL51:1422, retval=null, received=false, suspected=true)
      at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:3184)
      at org.jboss.cache.TreeCache.put(TreeCache.java:1741)
      at org.jboss.cache.aop.TreeCacheAop._putObject(TreeCacheAop.java:286)
      at org.jboss.cache.aop.TreeCacheAop.putObject(TreeCacheAop.java:132)
      at com.jpmorgan.ccs.impl.test.RepTest_1._add(RepTest_1.java:291)
      at com.jpmorgan.ccs.impl.test.RepTest_1.loadData(RepTest_1.java:128)
      at com.jpmorgan.ccs.impl.test.RepTest_1.main(RepTest_1.java:224)
      Caused by: org.jboss.cache.lock.TimeoutException: rsp=sender=WHOUATL2XBXL51:1422
      , retval=null, received=false, suspected=true
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:2145)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:2167)
      at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(Replicatio
      nInterceptor.java:89)
      at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:39)
      at org.jboss.cache.interceptors.TransactionInterceptor.invoke(Transactio
      nInterceptor.java:53)
      at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:3181)
      ... 6 more

      It looks the instance_1 failed to get the lock from instance_2 when it update the cache. But I already comment the "SyncReplTimeout" and "LockAcquisitionTimeout" attributes in the cache config file.
      Is there any way to avoid instance_1 crash because of the termination of instance_2?

      Thank you very much
      Tina

        • 1. Re: lock.TimeoutException crash the replicate-sync cluster
          belaban

          what do you mean by crash ? Termination of the VM ? B/c what you describe is a regular scenario, in which the first box waits until it (a) gets a response from the second box or (b) the second box is suspected.

          Bela

          • 2. Re: lock.TimeoutException crash the replicate-sync cluster
            norbert

            The Exception you get ist expected behavior of synchronous replication.

            With synchronous replication the caller will be notified of all communication-errors that occour during replication. This is intendet behaviour, your calling thread is notified so it can apply arbitrary actions.

            TimeOutExceptions occour as long the sending member assumes the receiving member is still alive so it still sends messages to the receiving member. In case the receiving member does not respond within a given timeout, the communication-stack (JGroups) assumes the receiving member may still be alive, but unresponsive (this state is called 'suspected'). Since in this situation JGroups cannot garantee messages will reach all members in the group, it notifies the caller by throwing TimeOutException. (see the attribute 'suspected=true' in the Exceptions message-string). If the 'suspected' member does not respond within another timeout-period, JGroups will decide it has died and remove it from the group. From this point in time it will no longer try to send messages to this host and no more TimeOutExceptions will occour.

            If you don't want your calling thread to be notified of such replication-errors, use asynchronous replication instead.

            • 3. Re: lock.TimeoutException crash the replicate-sync cluster
              tinachen

              Thanks bela and norbert:
              The problem in my code solved by catching the Exception then re-do the operation after another timeout-period to make sure the refreshment done in the remaining instances in the cluster.
              Thanks again.