2 Replies Latest reply on Apr 26, 2012 2:00 AM by sandkuma

Failover testing on 5.1.4.cr1

sandkuma Apr 23, 2012 1:37 AM

Hi,

I was testing failover on 3 node setup with numOwners 2.

I bring one node down and give a call for defaultexecutorservice submitEverywhere.. it also includes quite a few gets and puts. When using 5.1.2.final the behavior i found was that while the cluster views were fully estabilished the call would get into a exception scenario with a put or submitEverywhere.

In 5.1.4.cr1 i do not get exception while running the same scenario but i get inconsistent data when the call is returned and this continues on retry too...

Is there something that you'll have also faced and will be fixed in a future iteration?

<VERIFY_SUSPECT timeout="100000"/>

I do not see inconsistency of data when i had the following configuration:

<VERIFY_SUSPECT timeout="300000"/>

regards

Sandeep

1. Re: Failover testing on 5.1.4.cr1

dan.berindei Apr 25, 2012 7:39 AM (in response to sandkuma)

Sandeep, are you sure you didn't change anything else, like enabling L1? We do have a known issue when L1 is enabled that can cause inconsistencies when a node is added (https://issues.jboss.org/browse/ISPN-1830), but I'm not aware of any problems when nodes leave the cluster.

Can you send more details? A runnable test would be ideal.

Also, your VERIFY_SUSPECT timeout seems awfully high, are you sure state transfer is even started by the time your test ends when you run with 300000?
Actions
2. Re: Failover testing on 5.1.4.cr1

sandkuma Apr 26, 2012 2:00 AM (in response to dan.berindei)

Thanks Dan I will disable L1 cache and try... also i am going to reduce the verify suspect timeout.
The reason i kept huge timeouts was that in a vmware environment i found if i had lower timeout values there would be suspect exceptions and then we had some issues. (We have been using distributed mode from version 5.0 onwards and have seen many improvements since the beginning.)
Could you help me with the folowing doubts:
In case we bring a node down and we give a call to the server as soon as the node goes down we see that the server waits for the verify suspect timeout and then returns the call.
The call is basically a series of gets/puts/removes/distributedExecutor service submitEverywhere().
I have seen that in some cases it would fail in any of these calls... what is the right way to handle this do we do retries on put/get/remove?
What is the correct behavior that you expect the call should return without issues or we should send an exception to the user to retry later when all bookkeeping activities are done?
Actions

Go to original post