1 2 Previous Next 25 Replies Latest reply on Jun 28, 2011 4:02 AM by ataylor

Improvements to HA

timfox Jun 1, 2010 7:49 AM

We have a few outstanding JIRAs for some improvements to our HA implementation, also some of these issues have been discussed on the forums several times and seem to be a common subject.

I've spent the last few days working out a plan for tackling the remaining HA tasks, so we can really round-off our implementation well. Hopefully these proposals will satisfy most users and give us a first class HA implementation that really works well and ticks the boxes for even the most demanding uses cases.

I'd like to start work implementing these changes ASAP, so hopefully get them in 2.2, once we agree on them.

*** Please give your comments! Feedback is essential. ***

The changes mainly revolve around:

1) After failover has occurred from live to backup, we need the ability to resync a new backup with a live node.

2) Split brain. Currently we have no "split brain" protection on the server. I.e. in the case of a network partition, two servers can both think they are live.

3) Currently failover is triggered from the client side. This can result in erroneous activation of backup servers in the case of temporary network outages. We should instead detect failover on the server side. (related to 2)

4) We should have the ability to reconnect automatically to a list of servers which can be specified by the user. This would be especially useful for bridges.

5) We should not require UDP on the client side in order for clients to be notified of cluster topology updates.

6) We should allow fully propagate topology information to clients so clients can be notifed when new backups are added to live nodes so they can have continuous high availability without having to bring the cluste down. (related to 1)

.. and a few more bits.

Here is how I propose tackling this:

Shared file system HA Server setup

Shared file system failover will be extended as to work as follows:

Instead of a live node being configured with a backup connector (backup-connector-ref), it will be the backup node that is configured with a connector to the live node with <live-connector-ref>.

It will be possible to start many backup nodes for any particular live node at any one time. In effect you can have a "pool" of backup nodes waiting to be live nodes for a particular live node.

Both the live and the backup nodes will periodically monitor a file on the shared file system. The live node will periodically write it's node id and timestamp into the file. The backup nodes will monitor this file, and if the live node timestamp is not updated within a timeout, the backup nodes will consider the live node to have failed, and one of them will become live. (We can have a two phase procedure here to make sure only one node becomes live).

This method does not require any exclusive file system level locking which is not implemented well with some file systems.

Using the shared file also protects from any "split-brain" effects in the case of a network partition.

When a new backup server is paired to a live, it's information will be propagated to the cluster and pushed out to any clients connected to any node. We won't require UDP at the client to receive cluster updates, these can be propagated down the normal connection.

Shared nothing (replicated) HA server setup

For shared nothing, again, it will be the backup node which will be configured with a <live-connector-ref/> element. When a backup node is started it will try and create a connection to the live node. The live node will allow only one connection from a backup node at any one time.

When the backup node connects to the live node a synchronization protocol will be performed to ensure the backup and live nodes data are in sync.

The synchronization protocol will work as follows:

Each type of data element (journal, paging, large messages) will have a sequence number associated with it. This always increments.

First we compare the sequence numbers in the backup to the live. If live > backup then we simply copy across any missing files. During this part of the process the live node does not need to be locked.

This can be repeated several times to minimise the subsequent locking phase. Next we have to consider any new records that might have been created while the copy was occurring. For this we will need to lock the live to prevent any new records being created during this period.

The locking period should be small since we do the bulk of the copying in the first stage which does not require locking.

This means synchronization can be done while the live server is live and working.

It will be possible to have a "pool" of backup nodes waiting to connect to a live node. This means that when a live fails and the backup becomes live, another backup will automatically reconnect to the new live and perform the sync protocol.

Again, when a new backup server is paired to a live, it's information will be propagated to the cluster and pushed out to any clients connected to any node. We won't require UDP at the client to receive cluster updates, these can be propagated down the normal connection.

When using shared nothing we also need to protect from "split-brain". This can be done by requiring a quorum of nodes in the cluster for it to continue to operate.

Each cluster-connection can be configured with a <quorum-size/> config element which determines the minimum number of nodes required for the cluster to continue to operate.

If a the replicating connection from the live to the backup dies, the backup node will detect this. This could occur because of live node failure or because of some temporary network failure (e.g. network partition).

To distinguish between the two, the backup node will detect the connection failure and then ping each member of the cluster it knows about. If it receives a pong from at least <quorum-size/> other nodes within a timeout, then it will assume the live node has indeed died and it can take over as live. Otherwise it will remain as a backup and assume there is a temporary network partition, and continue trying to reconnect to the live node. Once the partition has been fixed, the backup will reconnect to the live and perform the resync protocol once more than resume normal backup operations.

For shared nothing it will also be possible to configure each live node with a <await-backup/> element. If true, then when starting the live node, it will await for a backup to connect before fully activating. This prevents costly resync operations having to be performed on a normal startup.

Client side failover (HA clients)

If a client is configured to use HA failover, then the clients initial connection parameters (e.g. UDP or a static server address) is only used to find an *initial* member of the cluster. Once that member is contacted, the cluster topology will be downloaded so it won't be necessary to rely on UDP at the client to do this or to hardcode a list of live/backup pairs.

As new live / backup servers are added to the cluster this info will be propagated to any HA clients.

If a HA client detects failure of it's connection, then it will try to connect with the backup node for that node. When it reconnects if the backup node has not detected failure at the server side, the client will be informed of this. The client can then try to connect the backup for a configurable period of time to allow the backup to detect and become live. If this does not occur within a timeout, then it may be there was no real failover at the server and the client can try to reconnect backp to the original server and resume it's connection there.

Failover with non HA clients

Sometimes you want clients to failover from one live node to another live node. I.e. not live to backup. Clearly the data will not be available there (i.e. it's not HA) but this may not matter in some use cases.

In this case, it will be possible to configure a client with a list of nodes. On client detecting failure it will try each node in the list in turn until it reconnects. This uses case is very useful for bridges or MDBs.

This info will also be available by the connection if that is required.

Fail-back

Once a failover has occurred from live to backup, it might be desirable to fail-back the new live node to a new backup, for example to get the new live node onto a specific machine.

We can add a method to the management API to force a failover of the server. This will have same semantics as any other failover.

1. Re: Improvements to HA

clebert.suconic Jun 1, 2010 9:48 AM (in response to timfox)

I think this summarizes what we need to do. I agree 100% with what you said here.

There is also this discussion here:

http://community.jboss.org/thread/150768?start=15&tstart=0

That I'm not sure if we opened a JIRA. (I couldn't find it at least), which would also add a new item to the HA improvements.
Actions
2. Re: Improvements to HA

clebert.suconic Jun 1, 2010 4:15 PM (in response to clebert.suconic)

"Both the live and the backup nodes will periodically monitor a file on the shared file system. The live node will periodically write it's node id and timestamp into the file. The backup nodes will monitor this file, and if the live node timestamp is not updated within a timeout"

I was just wondering about one thing...

You will have a ping/pong between the backup and the server, right?

while (areYouAtillAlive())
{
sleep(1000);
}

activate();

AreYouStillAlive could be done through ping/pongs through a file...
or ping/pongs through Netty.

Why are you choosing a file? Wouldn't be better to just have a Netty channel between live and backup to verify if the server still up or not?
Actions
3. Re: Improvements to HA

leosbitto Jun 1, 2010 6:28 PM (in response to clebert.suconic)

Clebert Suconic wrote:

"Both the live and the backup nodes will periodically monitor a file on the shared file system. The live node will periodically write it's node id and timestamp into the file. The backup nodes will monitor this file, and if the live node timestamp is not updated within a timeout"

Why are you choosing a file? Wouldn't be better to just have a Netty channel between live and backup to verify if the server still up or not?

I think that this relates to the "split brain" problem. When a cluster node loses its SAN connection to the shared filesystem (including the shared file), it has to consider itself deactivated (this brings another question: what happens when the SAN connection gets restored?). However, all the Netty stuff happens through independent LAN connection, where you would not know what to do if the cluster nodes get disconnected...
Actions
4. Re: Improvements to HA

clebert.suconic Jun 1, 2010 8:06 PM (in response to leosbitto)

I think a ping between backup and live would suffice here. (even on shared journal).

If something wrong happens, you just make the server stop answering the ping/pong.

You could have the same sort of bugs with either implementation. (I.e. the live server still responding pings even though it's on some "undesirable" state).
Actions

5. Re: Improvements to HA

ataylor Jun 2, 2010 1:45 AM (in response to timfox)

Shared file system HA Server setup

Shared file system failover will be extended as to work as follows:

Instead of a live node being configured with a backup connector (backup-connector-ref), it will be the backup node that is configured with a connector to the live node with <live-connector-ref>.

It will be possible to start many backup nodes for any particular live node at any one time. In effect you can have a "pool" of backup nodes waiting to be live nodes for a particular live node.

This method does not require any exclusive file system level locking which is not implemented well with some file systems.

Using the shared file also protects from any "split-brain" effects in the case of a network partition.

So once a back up node becomes live the other back up nodes don't really care which node is live they just keep on waiting and checking the timestamp?

Also how about prioritizing the backup servers so you can have some control over what order they will try to come up in? Taking this a bit further why have 1 live and n backups, why not just have n prioritized servers and they always start using 2 phase to check whether they should come up or not. This means that if a live server dies and a back up comes up and you want to restart the live server (which typically you want to do as its on your fastest machine) you just restart it and kill the backup and since the live (highest priority) server is the main server it will restart?

6. Re: Improvements to HA

ataylor Jun 2, 2010 2:03 AM (in response to timfox)

Shared nothing (replicated) HA server setup

For shared nothing, again, it will be the backup node which will be configured with a <live-connector-ref/> element. When a backup node is started it will try and create a connection to the live node. The live node will allow only one connection from a backup node at any one time.

When the backup node connects to the live node a synchronization protocol will be performed to ensure the backup and live nodes data are in sync.

The synchronization protocol will work as follows:

Each type of data element (journal, paging, large messages) will have a sequence number associated with it. This always increments.

First we compare the sequence numbers in the backup to the live. If live > backup then we simply copy across any missing files. During this part of the process the live node does not need to be locked.

This can be repeated several times to minimise the subsequent locking phase. Next we have to consider any new records that might have been created while the copy was occurring. For this we will need to lock the live to prevent any new records being created during this period.

The locking period should be small since we do the bulk of the copying in the first stage which does not require locking.

This means synchronization can be done while the live server is live and working.

It will be possible to have a "pool" of backup nodes waiting to connect to a live node. This means that when a live fails and the backup becomes live, another backup will automatically reconnect to the new live and perform the sync protocol.
With the sync protocol how many rounds of synchronisation do u have, if the journal is being added to then it may never synchronise fully (which is why you have the subsequent locking phase) so do u just have a fixed number of attempts or some other way?

When using shared nothing we also need to protect from "split-brain". This can be done by requiring a quorum of nodes in the cluster for it to continue to operate.

Each cluster-connection can be configured with a <quorum-size/> config element which determines the minimum number of nodes required for the cluster to continue to operate.

If a the replicating connection from the live to the backup dies, the backup node will detect this. This could occur because of live node failure or because of some temporary network failure (e.g. network partition).

To distinguish between the two, the backup node will detect the connection failure and then ping each member of the cluster it knows about. If it receives a pong from at least <quorum-size/> other nodes within a timeout, then it will assume the live node has indeed died and it can take over as live. Otherwise it will remain as a backup and assume there is a temporary network partition, and continue trying to reconnect to the live node. Once the partition has been fixed, the backup will reconnect to the live and perform the resync protocol once more than resume normal backup operations.
You could configure how many times it tries to reconnect incase there has been a network partition but the live server has also gone down. i.e. check should i start, reconnect n times, check should i start etc etc

For shared nothing it will also be possible to configure each live node with a <await-backup/> element. If true, then when starting the live node, it will await for a backup to connect before fully activating. This prevents costly resync operations having to be performed on a normal startup.
Maybe this should be the default behaviour
Actions
7. Re: Improvements to HA

ataylor Jun 2, 2010 2:16 AM (in response to timfox)

Client side failover (HA clients)

If a client is configured to use HA failover, then the clients initial connection parameters (e.g. UDP or a static server address) is only used to find an *initial* member of the cluster. Once that member is contacted, the cluster topology will be downloaded so it won't be necessary to rely on UDP at the client to do this or to hardcode a list of live/backup pairs.

As new live / backup servers are added to the cluster this info will be propagated to any HA clients.

If a HA client detects failure of it's connection, then it will try to connect with the backup node for that node. When it reconnects if the backup node has not detected failure at the server side, the client will be informed of this. The client can then try to connect the backup for a configurable period of time to allow the backup to detect and become live. If this does not occur within a timeout, then it may be there was no real failover at the server and the client can try to reconnect backp to the original server and resume it's connection there.
Sounds good. Am I right in thinking tho that all the information on available nodes ina cluster can only be propogated to the client if the cluster is symmetrical? what if the node that the client is connected to only knows about one other node in the cluster, i.e. a in a chain cluster.
Actions
8. Re: Improvements to HA

ataylor Jun 2, 2010 2:18 AM (in response to timfox)

Failover with non HA clients

Sometimes you want clients to failover from one live node to another live node. I.e. not live to backup. Clearly the data will not be available there (i.e. it's not HA) but this may not matter in some use cases.

In this case, it will be possible to configure a client with a list of nodes. On client detecting failure it will try each node in the list in turn until it reconnects. This uses case is very useful for bridges or MDBs.

This info will also be available by the connection if that is required.
Perfect for MDB's as you only want to receive a single mesage and dont care where it comes from.
Actions
9. Re: Improvements to HA

ataylor Jun 2, 2010 2:21 AM (in response to timfox)

Fail-back

Once a failover has occurred from live to backup, it might be desirable to fail-back the new live node to a new backup, for example to get the new live node onto a specific machine.

We can add a method to the management API to force a failover of the server. This will have same semantics as any other failover.
If the servers were prioritised as in my previous post all you would need to do is restart live and use the managemnt console to stop the backup server.

As an aside we should add some sort of clustering management control for monitoring all this?
Actions
10. Re: Improvements to HA

leosbitto Jun 2, 2010 3:44 AM (in response to clebert.suconic)

Clebert Suconic wrote:

I think a ping between backup and live would suffice here. (even on shared journal).

If something wrong happens, you just make the server stop answering the ping/pong.

You could have the same sort of bugs with either implementation. (I.e. the live server still responding pings even though it's on some "undesirable" state).

So what do you propose to happen with the live server when loses its backup server (no ping/pong for some time)? And the same with the backup server? How do you prevent the undesirable situation when both servers are active (serving clients)? This could happen if both cluster nodes are working fine, just the network between them fails. With the shared filesystem it is clear - if any node loses the connection to the shared filesystem, it must consider itself deactivated and wait until it gets the connection to the shared filesystem working again (and then do some magic to find out whether it should remain inactive, or become active).
Actions
11. Re: Improvements to HA

timfox Jun 2, 2010 3:47 AM (in response to clebert.suconic)

Clebert Suconic wrote:

"Both the live and the backup nodes will periodically monitor a file on the shared file system. The live node will periodically write it's node id and timestamp into the file. The backup nodes will monitor this file, and if the live node timestamp is not updated within a timeout"

I was just wondering about one thing...

You will have a ping/pong between the backup and the server, right?

No, only for shared nothing.
Clebert Suconic wrote:

Why are you choosing a file? Wouldn't be better to just have a Netty channel between live and backup to verify if the server still up or not?
Because using a file gives you split brain protection for free
Actions
12. Re: Improvements to HA

timfox Jun 2, 2010 3:49 AM (in response to leosbitto)

Leos Bitto wrote:

Clebert Suconic wrote:

"Both the live and the backup nodes will periodically monitor a file on the shared file system. The live node will periodically write it's node id and timestamp into the file. The backup nodes will monitor this file, and if the live node timestamp is not updated within a timeout"

Why are you choosing a file? Wouldn't be better to just have a Netty channel between live and backup to verify if the server still up or not?

I think that this relates to the "split brain" problem. When a cluster node loses its SAN connection to the shared filesystem (including the shared file), it has to consider itself deactivated (this brings another question: what happens when the SAN connection gets restored?).

Correct, shared file gives you split brain protection for free
Leos Bitto wrote:
However, all the Netty stuff happens through independent LAN connection, where you would not know what to do if the cluster nodes get disconnected...
Yes, that's why for shared nothing we have do jump through extra hoops at faiover time by attempting to contact a quorum of members before assuming live. For shared store it's simpler.
Actions
13. Re: Improvements to HA

timfox Jun 2, 2010 3:59 AM (in response to ataylor)
Andy Taylor wrote:

Shared file system HA Server setup

Shared file system failover will be extended as to work as follows:

Instead of a live node being configured with a backup connector (backup-connector-ref), it will be the backup node that is configured with a connector to the live node with <live-connector-ref>.

It will be possible to start many backup nodes for any particular live node at any one time. In effect you can have a "pool" of backup nodes waiting to be live nodes for a particular live node.

Both the live and the backup nodes will periodically monitor a file on the shared file system. The live node will periodically write it's node id and timestamp into the file. The backup nodes will monitor this file, and if the live node timestamp is not updated within a timeout, the backup nodes will consider the live node to have failed, and one of them will become live. (We can have a two phase procedure here to make sure only one node becomes live).

This method does not require any exclusive file system level locking which is not implemented well with some file systems.

Using the shared file also protects from any "split-brain" effects in the case of a network partition.

When a new backup server is paired to a live, it's information will be propagated to the cluster and pushed out to any clients connected to any node. We won't require UDP at the client to receive cluster updates, these can be propagated down the normal connection.

So once a back up node becomes live the other back up nodes don't really care which node is live they just keep on waiting and checking the timestamp?

Yes. Although I'm thinking now of using exclusive file locks not timestamps due to difficulties with files getting into strange states when multiple writers are writing them.

Java exclusive file locks don't work properly on NFS, GFS etc. But this shouldn't be a problem for us since we don't support NFS, GFS anyway

Andy Taylor wrote:

Also how about prioritizing the backup servers so you can have some control over what order they will try to come up in? Taking this a bit further why have 1 live and n backups, why not just have n prioritized servers and they always start using 2 phase to check whether they should come up or not. This means that if a live server dies and a back up comes up and you want to restart the live server (which typically you want to do as its on your fastest machine) you just restart it and kill the backup and since the live (highest priority) server is the main server it will restart?
I think what you're talking about is "fail-back" (?)

I think you could do this without prioritisation as follows:

Live server A fails over to backup B
B becomes live and server C becomes backup for B
You now intervene and stop server C, and restart A manually
A now becomes backup for B
You kill server B and A becomes live again (this could be done via the management console)
You then restart B and it becomes backup for A, so you're back where you started.
Actions
14. Re: Improvements to HA

timfox Jun 2, 2010 4:02 AM (in response to ataylor)

Andy Taylor wrote:

With the sync protocol how many rounds of synchronisation do u have, if the journal is being added to then it may never synchronise fully (which is why you have the subsequent locking phase) so do u just have a fixed number of attempts or some other way?

Probably just have fixed number of attempts (like 2 or 3) or perhaps this could be configurable.

Andy Taylor wrote:

You could configure how many times it tries to reconnect incase there has been a network partition but the live server has also gone down. i.e. check should i start, reconnect n times, check should i start etc etc

I was thinking just use the reconnect-attempts param on the factory for this.

Andy Taylor wrote:

Maybe this should be the default behaviour
Agreed, in the default xml config for a live HA this should exist.
Actions

1 2 Previous Next

Go to original post