7 Replies Latest reply on Sep 30, 2008 9:16 AM by pkaisernow

Failover on Shutdown

msaheb Jun 11, 2007 5:33 AM

The clustering mechanim is Jboss Messaging handle failover by allowing the switch to another JMS server of a cluster only if the initial JMS server on which the connection is created crashes. However, if the initial JMS server is stopped with shutdown the failover to a second server does not occur. Consequently any client having a connection of that stopped will not see its connection moving to another of the same JMS cluster.
I suggest to consider the shutdown as a a case to trigger the failover and not only the crash.

1. Re: Failover on Shutdown

timfox Jun 11, 2007 12:37 PM (in response to msaheb)

The main problem with the suggestion is the following:

In JBM each node maintains its own set of "partial queues". So if you have a distributed queue A, for example, deployed across the cluster, then A is actually made up of a set of n queues (n = number of nodes in cluster), each node maintaining its own queue.

So, in normal cluster operation, each partial queue will have its own set of messages in it.

On failover, another node in the cluster takes over the partial queues of the failed node and merges them into its own set of queues. We need to do this so the client can transparently carry on its session on the new node, e.g. it needs to know about messages it has consumed but not acked so they can subsequently be acked on the new node.

Shutting down a node in the cluster is quite different from failing over. With a shutdown, it is likely the sys admin is just taking the node down temporarily and will restart it again shortly. In this case we *do not* want another node to take over and merge the queues of the node that is being taken down.

Imagine the pathological case where the sysadmin takes every node in the cluster down one by one. By the time the sysadmin takes down the last node in the cluster, that node will have merged in *all* the queues from the other nodes! Not only will this be slow, but it means that when the cluster is brought back up, only one of the nodes (the last one to be taken down) will have all the messages.

This is obviously not something we want.

Another question we should ask is "is it possible to trigger the client side failover without triggering the server side failover?" - i.e. could we fail over the clients on shutdown without failing over the server side (i.e. merging queues)?

The answer to this is also no, since we would then lose transparent failover, since if we failed over a client which has unacked messages in its session then it would fail to ack them on the failover node since the queues hadn't been merged.

So in summary, it seems pretty clear to me that forcing failover on shutdown is not going to work with transparent failover.

However, nne thing we could maybe do is introduce a "JBoss MQ style mode", where we disable automatic failover on the server side - so if a server node fails then its queues are *not* merged into another nodes queues.

Client side automatic failover can already be disabled by setting "supportsFailover" to false on the connection factory.

By doing so, a client application could just catch the exception thrown to a connection ExceptionListener and re-lookup the connection factory using HAJNDI and recreate the connection etc, a la JBoss MQ. This would work but is it means none of the automatic failover functionality would be used, which seems a shame to me.
Actions
2. Re: Failover on Shutdown

clebert.suconic Jun 26, 2007 6:44 PM (in response to msaheb)

Can't we create a shutdown method where the sysadmin would be able to give an option for the shutdown type he wants to perform?

Say.. if the sysadmin decides to take the system down when the load is low.. he could decide to merge queues as he is planning the system out for a long period.

The sysadmin already has that option anyway... he would just have to kill the server.. this would be just a cleaner way to do it.

I thought about these options on Shutdown (Consider these options as a brain storm for now):

I - Keep Trying:
(Tell clients to keep trying reconnect until the server is back.. .the user could take that decision for quick restarts). We would need to send a message to every active connection before doing this, closing the valve on clients... and keep trying to connect until the server is back.

II - Shutdown as failover
(Merge queues as it would happen on a regular crash)

III - Just Shutdown
(And throw exceptions to clients.. so client would have to catch exception and retry "a la MQ")
Actions
3. Re: Failover on Shutdown

timfox Jun 26, 2007 7:26 PM (in response to msaheb)

For now we just need to add the flag as specified in the JIRA task.

If true then it shuts down without failing over
Actions
4. Re: Failover on Shutdown

jay.howell Sep 12, 2008 10:35 AM (in response to msaheb)

Here's the jira task that actaully implements the feature request.
https://jira.jboss.org/jira/browse/JBMESSAGING-1230

Note that is this only out for 1.4.0 sp3 cp02 and 1.4.1.

Jay:)
Actions
5. Re: Failover on Shutdown

pkaisernow Sep 26, 2008 8:10 AM (in response to msaheb)

I'm looking for this feature in the 1.4.0 sp3 package I downloaded. I don't see any reference to enableFailoverOnShutdown in the documentation or in the JMX console. Did it actually make the 12/13/07 release?
Actions
6. Re: Failover on Shutdown

timfox Sep 26, 2008 8:27 AM (in response to msaheb)

As Jay said:

"Here's the jira task that actaully implements the feature request.
https://jira.jboss.org/jira/browse/JBMESSAGING-1230

Note that is this only out for 1.4.0 sp3 cp02 and 1.4.1. "
Actions
7. Re: Failover on Shutdown

pkaisernow Sep 30, 2008 9:16 AM (in response to msaheb)

Sorry for being the redundant.

I downloaded the CP03 source and have a question about the failoverOnNodeLeave attribute. The JMX interface denies changes to the value if the PostOffice service is started. I think it would be helpful to be able to change this attribute after the service is started.

When I looked through the PostOffice service implemenation, I noticed the attribute is written in the constructor and read only once in one method, nodesLeft(). Other than as a basic service-level policy decision, I don't understand why this particular attribute cannot be changed at run-time. Is there some other behavior I'm not seeing?

Thanks,
Paul
Actions

Go to original post