3 Replies Latest reply on Nov 15, 2004 9:41 AM by tinachen

lock.TimeoutException crash the replicate-sync cluster

tinachen Nov 12, 2004 12:41 PM

Hi, All:
I meet an exception which cause two cache instances crash in a replicate-sync cluster.
The following is the testing step:
1. start two instances in one cluster which configured as replicate-sync mode.
2. start loading data in instance_1
3. while the instance_1 loading data, kill the instance_2.
4. instance_1 crash with the following exception:

org.jboss.util.NestedRuntimeException:
rsp=sender=WHOUATL2XBXL51:1422, retval=null, received=false, suspected=true; - nested throwable: (org.jboss.cache.lock.TimeoutException: rsp=sender=WHOUATL2XBXL51:1422, retval=null, received=false, suspected=true)
at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:3184)
at org.jboss.cache.TreeCache.put(TreeCache.java:1741)
at org.jboss.cache.aop.TreeCacheAop._putObject(TreeCacheAop.java:286)
at org.jboss.cache.aop.TreeCacheAop.putObject(TreeCacheAop.java:132)
at com.jpmorgan.ccs.impl.test.RepTest_1._add(RepTest_1.java:291)
at com.jpmorgan.ccs.impl.test.RepTest_1.loadData(RepTest_1.java:128)
at com.jpmorgan.ccs.impl.test.RepTest_1.main(RepTest_1.java:224)
Caused by: org.jboss.cache.lock.TimeoutException: rsp=sender=WHOUATL2XBXL51:1422
, retval=null, received=false, suspected=true
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:2145)
at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:2167)
at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(Replicatio
nInterceptor.java:89)
at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:39)
at org.jboss.cache.interceptors.TransactionInterceptor.invoke(Transactio
nInterceptor.java:53)
at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:3181)
... 6 more

It looks the instance_1 failed to get the lock from instance_2 when it update the cache. But I already comment the "SyncReplTimeout" and "LockAcquisitionTimeout" attributes in the cache config file.
Is there any way to avoid instance_1 crash because of the termination of instance_2?

Thank you very much
Tina

1. Re: lock.TimeoutException crash the replicate-sync cluster

belaban Nov 15, 2004 3:07 AM (in response to tinachen)

what do you mean by crash ? Termination of the VM ? B/c what you describe is a regular scenario, in which the first box waits until it (a) gets a response from the second box or (b) the second box is suspected.

Bela
Actions
2. Re: lock.TimeoutException crash the replicate-sync cluster

norbert Nov 15, 2004 4:15 AM (in response to tinachen)

The Exception you get ist expected behavior of synchronous replication.

With synchronous replication the caller will be notified of all communication-errors that occour during replication. This is intendet behaviour, your calling thread is notified so it can apply arbitrary actions.

TimeOutExceptions occour as long the sending member assumes the receiving member is still alive so it still sends messages to the receiving member. In case the receiving member does not respond within a given timeout, the communication-stack (JGroups) assumes the receiving member may still be alive, but unresponsive (this state is called 'suspected'). Since in this situation JGroups cannot garantee messages will reach all members in the group, it notifies the caller by throwing TimeOutException. (see the attribute 'suspected=true' in the Exceptions message-string). If the 'suspected' member does not respond within another timeout-period, JGroups will decide it has died and remove it from the group. From this point in time it will no longer try to send messages to this host and no more TimeOutExceptions will occour.

If you don't want your calling thread to be notified of such replication-errors, use asynchronous replication instead.
Actions
3. Re: lock.TimeoutException crash the replicate-sync cluster

tinachen Nov 15, 2004 9:41 AM (in response to tinachen)

Thanks bela and norbert:
The problem in my code solved by catching the Exception then re-do the operation after another timeout-period to make sure the refreshment done in the remaining instances in the cluster.
Thanks again.
Actions

Go to original post