Sandeep, are you sure you didn't change anything else, like enabling L1? We do have a known issue when L1 is enabled that can cause inconsistencies when a node is added (https://issues.jboss.org/browse/ISPN-1830), but I'm not aware of any problems when nodes leave the cluster.
Can you send more details? A runnable test would be ideal.
Also, your VERIFY_SUSPECT timeout seems awfully high, are you sure state transfer is even started by the time your test ends when you run with 300000?
Thanks Dan I will disable L1 cache and try... also i am going to reduce the verify suspect timeout.
The reason i kept huge timeouts was that in a vmware environment i found if i had lower timeout values there would be suspect exceptions and then we had some issues. (We have been using distributed mode from version 5.0 onwards and have seen many improvements since the beginning.)
Could you help me with the folowing doubts:
In case we bring a node down and we give a call to the server as soon as the node goes down we see that the server waits for the verify suspect timeout and then returns the call.
The call is basically a series of gets/puts/removes/distributedExecutor service submitEverywhere().
I have seen that in some cases it would fail in any of these calls... what is the right way to handle this do we do retries on put/get/remove?
What is the correct behavior that you expect the call should return without issues or we should send an exception to the user to retry later when all bookkeeping activities are done?