Failback to primary server with shared store
jmesnil Sep 14, 2010 11:06 AMI am figuring out what is the best way to fail back to a primary server after it crashed and failover occurs.
Let's have a live node A and 2 backups B and C.
When A is live, we have:
A B C - is live - is backup - is backup - holds live.lock - holds backup.lock - wait to lock backup.lock - wait for live.lock
Then A crashed, B will become the live node and C the main backup:
A B C - DOWN - is live - is backup - holds live.lock - holds backup.lock - wait for live.lock
Now we want to restart server A and makes it the live server again.
How can we do that?
The simple way is to make A a backup server and do failover again and again until it becomes live.
Let's change the configuration of A to flag it as a backup to the current live node B and restart it:
A B C - is backup - is live - is backup - waits for backup.lock - holds live.lock - holds backup.lock - wait for live.lock
To have A becomes live again, we will have to kill B to failover to C.
Then A will be the main backup
A B C - is backup - DOWN - is live - holds backup.lock - holds live.lock - wait for live.lock
We can here restart B, A holds the backup.lock and will be the next server in line to become live
A B C - is backup - is backup - is live - holds backup.lock - waits for backup.lock - holds live.lock - wait for live.lock
Then we also kill C to failover to A
A B C - is live - is backup - DOWN - holds live.lock - holds backup.lock - wait for live.lock
Again, we can now restart server C to be in the same state than initially:
A B C - is live - is backup - is backup - holds live.lock - holds backup.lock - wait to lock backup.lock - wait for live.lock
This is should work assuming that the admin *change the configuration of A to flag it as a backup*.
Indeed, this does not work if A is restarted as a live server.
Let's check why. A is down, and B is the live node:
A B C - DOWN - is live - is backup - holds live.lock - holds backup.lock - wait for live.lock
At that point, the client is connected to B and C is the backup in its topology.
If A is restarted as a live server, we would have:
A B C - is live - DOWN - is backup - holds live.lock - holds backup.lock - wait for live.lock
Assuming we stop cleanly server B (deleting file.lock), C would not be activated and would not become live.
A would become live *but* the client would never know about it!
The client was only connected to B. When the server is stopped, the client will not failover to C. And even though we could failover to it, C is not activated and would not accept any connection.
So the client would not be able to know that A is the new live node.
I don't see yet how we could make A the live node again without first flagging it as a backup and failover again and again until it is activated.
This is the only way I see to let the clients know the correct topology at any given time.
Do you have any alternative idea?