HAJMS Failover problem running within a Veritas cluster

sheckler Apr 17, 2008 7:59 AM

Hi,
let me try to describe the architecture on the customers site and the observed behavior, which was categorized as an error by the customer.

An JBoss cluster is running within a Veritas cluster on Solaris (JBoss Version 3.2.8SP1)
There are two zones, each containing 1 application server and 1 oracle server. In zone A the HAJMS Master and the oracle server for the HAJMS is running. In zone B the second jboss cluster node, which is not HAJMS Master, is running and the standby database.

Now the zone A is killed and the following happens on the second jboss cluster node.:

- HAJMS Failover of running JMS Clients like MDBs etc. -> ok
- Ping database failure of connection pool -> ok
- a new cluster view is detected with only one member -> ok
- the node is not becoming HAJMS master -> not ok

After 1 minute the standby db is available and the errors of the connection pool stop

- The node is never becoming HAJMS master and has to be restarted, as there is no more JMS available.
This means downtime and manuel action.

Obviously the JMS master change does not work, when at the same time the database is not available.

My hope is, that some parametrization of JMS can avoid this situation. Has anyone an idea?

1. Re: HAJMS Failover problem running within a Veritas cluster

brian.stansberry Apr 22, 2008 12:29 PM (in response to sheckler)

Hmm, perhaps you can come up with some mechanism to emit a JMX notification when the connection pool (i.e. the standby db) is available. Then add a barrier (http://wiki.jboss.org/wiki/BarrierController). Then the JMS services depend on the barrier.

Very convoluted. And when the node becomes master, you'll get noise in the logs as complaining about unsatisfied dependencies. But then as the barrier gets deployed, the dependencies will be satisfied and the JMS server will start.

Another possible direction is to use a custom subclass of HASingletonController that overrides startSingleton() and blocks at the beginning waiting for the connection pool to become available before continuing. Again, you need some sort of signal that the pool is available.
Actions