We have a few outstanding JIRAs for some improvements to our HA implementation, also some of these issues have been discussed on the forums several times and seem to be a common subject.
I've spent the last few days working out a plan for tackling the remaining HA tasks, so we can really round-off our implementation well. Hopefully these proposals will satisfy most users and give us a first class HA implementation that really works well and ticks the boxes for even the most demanding uses cases.
I'd like to start work implementing these changes ASAP, so hopefully get them in 2.2, once we agree on them.
*** Please give your comments! Feedback is essential. ***
The changes mainly revolve around:
1) After failover has occurred from live to backup, we need the ability to resync a new backup with a live node.
2) Split brain. Currently we have no "split brain" protection on the server. I.e. in the case of a network partition, two servers can both think they are live.
3) Currently failover is triggered from the client side. This can result in erroneous activation of backup servers in the case of temporary network outages. We should instead detect failover on the server side. (related to 2)
4) We should have the ability to reconnect automatically to a list of servers which can be specified by the user. This would be especially useful for bridges.
5) We should not require UDP on the client side in order for clients to be notified of cluster topology updates.
6) We should allow fully propagate topology information to clients so clients can be notifed when new backups are added to live nodes so they can have continuous high availability without having to bring the cluste down. (related to 1)
.. and a few more bits.
Here is how I propose tackling this:
Shared file system HA Server setup
Shared file system failover will be extended as to work as follows:
Instead of a live node being configured with a backup connector (backup-connector-ref), it will be the backup node that is configured with a connector to the live node with <live-connector-ref>.
It will be possible to start many backup nodes for any particular live node at any one time. In effect you can have a "pool" of backup nodes waiting to be live nodes for a particular live node.
Both the live and the backup nodes will periodically monitor a file on the shared file system. The live node will periodically write it's node id and timestamp into the file. The backup nodes will monitor this file, and if the live node timestamp is not updated within a timeout, the backup nodes will consider the live node to have failed, and one of them will become live. (We can have a two phase procedure here to make sure only one node becomes live).
This method does not require any exclusive file system level locking which is not implemented well with some file systems.
Using the shared file also protects from any "split-brain" effects in the case of a network partition.
When a new backup server is paired to a live, it's information will be propagated to the cluster and pushed out to any clients connected to any node. We won't require UDP at the client to receive cluster updates, these can be propagated down the normal connection.
Shared nothing (replicated) HA server setup
For shared nothing, again, it will be the backup node which will be configured with a <live-connector-ref/> element. When a backup node is started it will try and create a connection to the live node. The live node will allow only one connection from a backup node at any one time.
When the backup node connects to the live node a synchronization protocol will be performed to ensure the backup and live nodes data are in sync.
The synchronization protocol will work as follows:
Each type of data element (journal, paging, large messages) will have a sequence number associated with it. This always increments.
First we compare the sequence numbers in the backup to the live. If live > backup then we simply copy across any missing files. During this part of the process the live node does not need to be locked.
This can be repeated several times to minimise the subsequent locking phase. Next we have to consider any new records that might have been created while the copy was occurring. For this we will need to lock the live to prevent any new records being created during this period.
The locking period should be small since we do the bulk of the copying in the first stage which does not require locking.
This means synchronization can be done while the live server is live and working.
It will be possible to have a "pool" of backup nodes waiting to connect to a live node. This means that when a live fails and the backup becomes live, another backup will automatically reconnect to the new live and perform the sync protocol.
Again, when a new backup server is paired to a live, it's information will be propagated to the cluster and pushed out to any clients connected to any node. We won't require UDP at the client to receive cluster updates, these can be propagated down the normal connection.
When using shared nothing we also need to protect from "split-brain". This can be done by requiring a quorum of nodes in the cluster for it to continue to operate.
Each cluster-connection can be configured with a <quorum-size/> config element which determines the minimum number of nodes required for the cluster to continue to operate.
If a the replicating connection from the live to the backup dies, the backup node will detect this. This could occur because of live node failure or because of some temporary network failure (e.g. network partition).
To distinguish between the two, the backup node will detect the connection failure and then ping each member of the cluster it knows about. If it receives a pong from at least <quorum-size/> other nodes within a timeout, then it will assume the live node has indeed died and it can take over as live. Otherwise it will remain as a backup and assume there is a temporary network partition, and continue trying to reconnect to the live node. Once the partition has been fixed, the backup will reconnect to the live and perform the resync protocol once more than resume normal backup operations.
For shared nothing it will also be possible to configure each live node with a <await-backup/> element. If true, then when starting the live node, it will await for a backup to connect before fully activating. This prevents costly resync operations having to be performed on a normal startup.
Client side failover (HA clients)
If a client is configured to use HA failover, then the clients initial connection parameters (e.g. UDP or a static server address) is only used to find an *initial* member of the cluster. Once that member is contacted, the cluster topology will be downloaded so it won't be necessary to rely on UDP at the client to do this or to hardcode a list of live/backup pairs.
As new live / backup servers are added to the cluster this info will be propagated to any HA clients.
If a HA client detects failure of it's connection, then it will try to connect with the backup node for that node. When it reconnects if the backup node has not detected failure at the server side, the client will be informed of this. The client can then try to connect the backup for a configurable period of time to allow the backup to detect and become live. If this does not occur within a timeout, then it may be there was no real failover at the server and the client can try to reconnect backp to the original server and resume it's connection there.
Failover with non HA clients
Sometimes you want clients to failover from one live node to another live node. I.e. not live to backup. Clearly the data will not be available there (i.e. it's not HA) but this may not matter in some use cases.
In this case, it will be possible to configure a client with a list of nodes. On client detecting failure it will try each node in the list in turn until it reconnects. This uses case is very useful for bridges or MDBs.
This info will also be available by the connection if that is required.
Once a failover has occurred from live to backup, it might be desirable to fail-back the new live node to a new backup, for example to get the new live node onto a specific machine.
We can add a method to the management API to force a failover of the server. This will have same semantics as any other failover.