JBPAPP-863 -- FC blocks during slow failure detection
brian.stansberry Jun 19, 2008 3:17 PMDiscussion of http://jira.jboss.com/jira/browse/JBPAPP-863 .
Removing FC is not a general purpose option as it is highly likely that will lead to OOM problems. Removing it in a 2 node TCP-based cluster like this test uses is OK, as FC serves no purpose in such a setup. See "Why is FC needed on top of TCP?" section at http://wiki.jboss.org/wiki/JGroupsFC . But I don't think the purpose of these tests is to test a specialized config, so don't think we should do it.
To an extent this result is showing a configuration tradeoff in the config of the FD protocol. The default config we ship with has this:
<FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/> <VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
The total time it takes to detect a failure and remove a dead member is (FD.timeout * FD.max_tries) + VERIFY_SUSPECT.timeout = 51500 ms. That is, unless FD_SOCK detects the fairly immediately, which it doesn't in this pull-the-plug scenario.
51.5 secs is a long time. Pretty much your "cca 1 minute after other node unplugging." During that time, FC will block waiting for credits from the unreachable node (once it runs out of credits). Once the unreachable peer is removed from the group, FC will unblock.
We have a high timeout to prevent false suspicions when nodes don't reply to heartbeats due to temporary CPU spikes, long GC pauses etc. Perhaps 51500 ms is excessive. Let's discuss this with the group on Monday's clustering conf call. Users can always reduce this config to some lower value > X if they feel any extended CPU spikes or GC pauses will not exceed X. I wouldn't recommend less than 30 seconds or so. Used to be 15 and we were constantly hearing about false suspicion issues.
That's the core issue. Now a side issue:
The fact that one thread is trying to lock the session while another is trying to replicate it proves that there are multiple requests for the session. Those requests stack up and will put a very high replication load on the system. Question is why multiple requests are occuring. I want to understand this. Some possibilities:
1) Test servlet is flushing the response (i.e not waiting for JBossWeb to do it after the the replication), test client is reading the response, considering it completed and making another request. Dominik, is this possible?
2) Test client waits X seconds for a response; if doesn't get one retries the request. Analogous to user getting impatient and hitting refresh button. If this is what is happening, need to know what X is and decide how reasonable that is; perhaps make it a random value inside a range (i.e. 100% of user don't hit refresh after 10 secs).
3) mod_jk is itself retrying the request, either on same node or as an attempted failover (looks like same node). Do you have worker.xxx.reply_timeout configured in workers.properties?