5 Replies Latest reply on Nov 11, 2011 4:25 AM by rhusar

Cluster 1.1 - Failover and Timeout

lukasbradley Nov 9, 2011 9:12 PM

I'm a little confused about the failover from Apache to the JBOSS nodes. Is it the cping and cpong values that need to be set? We have tried setting these lower, but it doesn't seem to get us what we want.

It seems as if the failover from one node to another is much slower than anticipated. Is it possible to make this in the second or sub-second range?

Thanks for any and all help.

1. Re: Cluster 1.1 - Failover and Timeout

rhusar Nov 10, 2011 3:42 AM (in response to lukasbradley)

Hi Lukas,

the defaults are set to fit most of the people's usecases, you shouldnt have to change any values.
What are you specifically trying to achieve?
What kind of failure are you simulating and anticipating?
And what do you mean by "slower than anticipated"?

Rado
Actions
2. Re: Cluster 1.1 - Failover and Timeout

lukasbradley Nov 10, 2011 10:33 AM (in response to rhusar)

Rado, thank you very much for taking the time.

We have two Apache 2.2 web servers running mod cluster. They connect to two JBoss 5.1 nodes to distribute trafffic. The requirement is that failover be "seemless" to the user when a node is taken off-line or catastrophically fails.

Our goal is to have the user logged into one node and clicking around the site and have as close to a seemless experience as possible on failure. To test this, we take down that JBoss node the user is on, and have the user then click on another link to perform an action. The expected behavior is they should be routed to the JBoss node that still is running within a second of the first node going offline.

This is not the case. We see that the use is continually *attempted* to be sent to the node that was taken offline for the next 10-20 seconds.

I've used haproxy (software proxy server and load balancer) as well as mod proxy and mod jk and configured this behavior. I feel it should be possible to have sub-second failover between these nodes, but all the configuration attempts we try don't seem to work.

Again, thank you for your insights.
Actions
3. Re: Cluster 1.1 - Failover and Timeout

rhusar Nov 10, 2011 1:57 PM (in response to lukasbradley)

Hi Lukas,

We have two Apache 2.2 web servers running mod cluster. They connect to two JBoss 5.1 nodes to distribute trafffic. The requirement is that failover be "seemless" to the user when a node is taken off-line or catastrophically fails.
This what you are describing is a standard use case why you would go to clustering. At the same time, for having the ability to failover you need to sacrifice a little bit of performance.

First, you should make sure that when you use mod_jk or just when you test failover manually you are able to failover. Maybe sub-second is unrealistic -- if you fail the server immediately upon request. Those sessions that hit this fail point exactly, might have to wait a little longer, because the load balancer needs to become aware of it. There are timeouts so that when there is a short network outage, the server is not prematurely removed from cluster and trigger the failover logic (moving around sessions in the cluster).

Does that make sense so far?

node that was taken offline for the next 10-20 seconds.
So this sounds like you want smaller/stricter timeout.

PS: in the event of shutdown/restart you should use "draining" in mod_cluster.

Rado
Actions
4. Re: Cluster 1.1 - Failover and Timeout

lukasbradley Nov 10, 2011 2:49 PM (in response to rhusar)

First, you should make sure that when you use mod_jk or just when you test failover manually you are able to failover. Maybe sub-second is unrealistic -- if you fail the server immediately upon request. Those sessions that hit this fail point exactly, might have to wait a little longer, because the load balancer needs to become aware of it. There are timeouts so that when there is a short network outage, the server is not prematurely removed from cluster and trigger the failover logic (moving around sessions in the cluster).

Does that make sense so far?

Definitely, and we have confirmed this. We are failling over to the node that is active, but the time that takes is not satisfactory.

node that was taken offline for the next 10-20 seconds.
So this sounds like you want smaller/stricter timeout.

PS: in the event of shutdown/restart you should use "draining" in mod_cluster.

That is exactly what we want, but which mod cluster timeout setting will reduce this? ping/pong?

Again, thank you for your help.
Actions
5. Re: Cluster 1.1 - Failover and Timeout

rhusar Nov 11, 2011 4:25 AM (in response to lukasbradley)

One more thing, how do you test the failover exactly? Tell me the commands.

So you might wanna look at docs http://docs.jboss.org/mod_cluster/1.1.0/html_single/ so that:

Reducing ping = Time (in seconds) in which to wait for a pong answer to a ping

Reducing nodeTimeout = Timeout (in seconds) for proxy connections to a node. That is the time mod_cluster will wait for the back-end response before returning error. That corresponds to timeout in the worker mod_proxy documentation. A value of -1 indicates no timeout. Note that mod_cluster always uses a cping/cpong before forwarding a request and the connectiontimeout value used by mod_cluster is the ping value.
Actions

Go to original post