I have a 4 node jboss(3.2.6) cluster. We finally got failovers to work correctly (if jboss on the master node stops) after following the suggestions in
We found that the failovers were happening on the cluster when jboss pauses for a long period of time on the master node(Node 1). When this happens, Node 1 is SUSPECTED and another node(Node 2) takes over as the master. After a minute or so Node 1 quits to be the master. So the approximate timeline is:
Node 1 Paused: 00:00
Node 1 Unpaused: 00:18
Node 2 becomes the master: 00:30
Node 1 quits being the master: 01:30
After this JMS is no longer functional.
Our assumption is that when Node 2 became the master, the MDBs moved to Node 2 and connected to JMS which was still on Node 1. Then JMS also moved to Node 2 while the MDBs are still trying to connect to JMS on Node 1, which caused messaging to stop working.
We are also not quite sure as to why jboss paused for such a long time. We think that GC might be the culprit. There was a lot of activity on the system when all this happened. & we are using the default GC. So this might have paused the application for 18 secs while it was running. We are considering switching to "UseConcMarkSweepGC" which is supposed to reduce the time the application is paused.
So I have multiple questions:
1) Is there a way to make failovers happen smoothly even if jboss pauses for some time?
2) Are we right in assuming that GC might have paused the application for so long? If so, will using a different GC help?