Failback in new HA
ataylor Oct 13, 2010 11:38 AMWe need to be able to support Failback in the new HA code. Baswically this is the original live server resuming its role after it has been failed over to another backup node. typically this is for the platforms where AS nodes will have live/backup on the same server and used itypically by JEE applications. The other is for 'serious' messaging users that will create dedicated AS nodes for backup servers.
Firstly i'm making a few changes to make this a bit simpler to implement. Im moving all the FIle locking code into a node manager class to clean up hornetqserverimpl. Also currently we have 3 lock files we create, live lock, backup lock and nodeid lock. instead of this i will be using 1 file and lock portions of it. I'll also be using this file to hold the node id and the state of the current live server which i will come to later. So the file will be like this:
byte 1 - 1 byte server status
byte 2 - live lock segment
byte 3 - backup lock segment
byte 4 to 19 - node id
The locking algorithm will be pretty much the same, i.e. back up waits on live lock, potential backups wait on backup lock etc.
The server status will work as follows: There will be 3 status's, L meaning a server is Live, P meaning a server is paused i.e. awaiting restart, and F meaning a live server wants to fail back.
A live server starting will create the file, obtain the live lock and set the status in the file to L(ive). On normal shutdown it sets the status to P(aused).
so a backup starts and waits for the lock file to be created, when it is it checks the state, if it is L then the live server must have crashed and it can come up. If it is P then the live server has been shutdown normally and we don't start up just wait again for the file to be created.
Now we have this we can add management methods to allow users to stop servers and force failover, i.e. stop but leave the stat as L(ive).
Now failback, if a live server is restarted we can check a flag, say allowFailback, if this is true then before obtaining the livelock (which is obviously locked) we set the status to F(ailback). Any potential backup nodes that are still awaiting the backup lock will only lock the backup lock if the status is not F. this means we can gaurantee the original live server becoming taking over from the new live server.
At this point, the new live server can either be shutdown manually or maybe we could have a special cluster message that forces it. The live server has now failed back.
This also makes the tests much cleaner as we can write an InVMNodeManager.