3 Replies Latest reply on Sep 15, 2010 3:25 AM by timfox

Failback to primary server with shared store

jmesnil Sep 14, 2010 11:06 AM

I am figuring out what is the best way to fail back to a primary server after it crashed and failover occurs.

Let's have a live node A and 2 backups B and C.

When A is live, we have:

      A                            B                                C
- is live                     - is backup                       - is backup
- holds live.lock             - holds backup.lock               - wait to lock backup.lock
                              - wait for live.lock

Then A crashed, B will become the live node and C the main backup:

      A                            B                                C
- DOWN                        - is live                       - is backup
                              - holds live.lock               - holds backup.lock
                                                              - wait for live.lock

Now we want to restart server A and makes it the live server again.

How can we do that?

The simple way is to make A a backup server and do failover again and again until it becomes live.

Let's change the configuration of A to flag it as a backup to the current live node B and restart it:

      A                            B                                C
- is backup                        - is live                  - is backup
- waits for backup.lock            - holds live.lock          - holds backup.lock
                                                              - wait for live.lock

To have A becomes live again, we will have to kill B to failover to C.

Then A will be the main backup

      A                            B                                C
- is backup                    - DOWN                         - is live
- holds backup.lock                                           - holds live.lock
- wait for live.lock

We can here restart B, A holds the backup.lock and will be the next server in line to become live

      A                            B                                C
- is backup                    - is backup                    - is live
- holds backup.lock            - waits for backup.lock        - holds live.lock
- wait for live.lock

Then we also kill C to failover to A

      A                            B                                C
- is live                      - is backup                         - DOWN
- holds live.lock              - holds backup.lock
                               - wait for live.lock

Again, we can now restart server C to be in the same state than initially:

      A                            B                                C
- is live                     - is backup                       - is backup
- holds live.lock             - holds backup.lock               - wait to lock backup.lock
                              - wait for live.lock

This is should work assuming that the admin *change the configuration of A to flag it as a backup*.

Indeed, this does not work if A is restarted as a live server.

Let's check why. A is down, and B is the live node:

      A                            B                                C
- DOWN                        - is live                       - is backup
                              - holds live.lock               - holds backup.lock
                                                              - wait for live.lock

At that point, the client is connected to B and C is the backup in its topology.

If A is restarted as a live server, we would have:

      A                            B                                C
- is live                      - DOWN                         - is backup
- holds live.lock                                             - holds backup.lock
                                                              - wait for live.lock

Assuming we stop cleanly server B (deleting file.lock), C would not be activated and would not become live.

A would become live *but* the client would never know about it!

The client was only connected to B. When the server is stopped, the client will not failover to C. And even though we could failover to it, C is not activated and would not accept any connection.

So the client would not be able to know that A is the new live node.

I don't see yet how we could make A the live node again without first flagging it as a backup and failover again and again until it is activated.

This is the only way I see to let the clients know the correct topology at any given time.

Do you have any alternative idea?

1. Re: Failback to primary server with shared store

timfox Sep 14, 2010 1:02 PM (in response to jmesnil)

To simplify this:

Nodes A and B, A is live, B is backup

A fails, B becomes live.

If you now want A to resume as live, e.g· node A is a bigger box than node B so you don't want traffic on node B for too long

You set A to backup = false in the config then start it

It now becomes the backup

You then failover from B to A, now A is live again.

Now restart B as backup
Actions
2. Re: Failback to primary server with shared store

jmesnil Sep 15, 2010 3:09 AM (in response to timfox)

yes, that's what happens if you have only 1 backup to the primary server.

My point was that if you have more than 1 backup, there is more steps.
Either you have to stop all the waiting backups and restarts the primary server as a backup to make it the next backup to failover to.
Or you have to failover on all the backups until it becomes live again.

We need to make sure the steps to failback to primary server is well-documented.
Actions
3. Re: Failback to primary server with shared store

timfox Sep 15, 2010 3:25 AM (in response to jmesnil)

Well... there is only ever a maximum of one backup. The waiting backups are not backups yet - this a different state (C is wrongly labelled as backup in your state diagram)

If you have some waiting backups simply stop them first before doing fail back.
Actions

Go to original post