Advice please JBOSS timings
rockshore Jun 9, 2014 9:10 AMHI all, first timer here so please be gentle!
We have two machines each running a JBoss EAP 5.1 application server. The servers are in a cluster.
Machine 1 is started with:
~/jboss-eap-5.1/jboss-as/bin/run.sh -c all -g CDMLive -b 0.0.0.0 -Djboss.messaging.ServerPeerID=1
Machine 2 is started with:
~/jboss-eap-5.1/jboss-as/bin/run.sh -c all -g CDMLive -b 0.0.0.0 -Djboss.messaging.ServerPeerID=2
We deploy a web application to the servers as an exploded WAR in the JBoss server/all/deploy directory on both machines. The web application jboss-web.xml specifies a HA Singleton deployment as such:
<depends>jboss.ha:service=HASingletonDeployer,type=Barrier</depends>
It is important that the web application only runs on one node at a time as running two instances of it at the same time causes issues.
Our problem is that the machines are hosted as virtual machines and that a backup operation causes the virtual machines to become heavily loaded, lock up completely, or lose network connectivity for a period of time. We cannot avoid this at the moment. Typically, if the lockup occurs on the first JBoss which is running the web application, the second JBoss will notice and output a message to the log. For example:
2014-04-15 13:22:49,533 WARN [org.jboss.messaging.core.impl.clusterconnection.ClusterConnectionManager] (Thread-46) Connection failure detected. Clean up and retry connection. maxRetry: -1 retryInterval: 5000
2014-04-15 13:22:51,544 ERROR [org.jboss.messaging.core.impl.clusterconnection.ClusterConnectionManager] (Thread-46) Retrying ConnectionInfo org.jboss.messaging.core.impl.clusterconnection.ClusterConnectionManager$ConnectionInfo@481e2e4e failed after maxmum retry: 0
We would thus like to know the best settings to change to increase the time which the JBoss servers allow before they conclude that a node has fallen from the cluster and/or to be more lenient when failing to communicate with the other server. This should then allow the "lockup" to occur without accidentally starting up another instance of the web application. We accept that this will increase the time taken for the cluster to recognise a genuine fault - it is something we are willing to accept.
Many thanks in advance for any help.