3 Replies Latest reply on Sep 15, 2010 3:25 AM by timfox

    Failback to primary server with shared store

    jmesnil

      I am figuring out what is the best way to fail back to a primary server after it crashed and failover occurs.

       

      Let's have a live node A and 2 backups B and C.

       

      When A is live, we have:

       

            A                            B                                C
      - is live                     - is backup                       - is backup
      - holds live.lock             - holds backup.lock               - wait to lock backup.lock
                                    - wait for live.lock
      

       

      Then A crashed, B will become the live node and C the main backup:

       

            A                            B                                C
      - DOWN                        - is live                       - is backup
                                    - holds live.lock               - holds backup.lock
                                                                    - wait for live.lock
      

       

      Now we want to restart server A and makes it the live server again.

       

      How can we do that?

       

      The simple way is to make A a backup server and do failover again and again until it becomes live.

      Let's change the configuration of A to flag it as a backup to the current live node B and restart it:

       

            A                            B                                C
      - is backup                        - is live                  - is backup
      - waits for backup.lock            - holds live.lock          - holds backup.lock
                                                                    - wait for live.lock
       
      

       

      To have A becomes live again, we will have to kill B to failover to C.

      Then A will be the main backup

       

            A                            B                                C
      - is backup                    - DOWN                         - is live
      - holds backup.lock                                           - holds live.lock
      - wait for live.lock                                                           
       
       
      

       

      We can here restart B, A holds the backup.lock and will be the next server in line to become live

       

            A                            B                                C
      - is backup                    - is backup                    - is live
      - holds backup.lock            - waits for backup.lock        - holds live.lock
      - wait for live.lock                                                           
       
       
       
      

       

      Then we also kill C to failover to A

       

            A                            B                                C
      - is live                      - is backup                         - DOWN
      - holds live.lock              - holds backup.lock
                                     - wait for live.lock                            
       
       
       
      

       

      Again, we can now restart server C to be in the same state than initially:

       

            A                            B                                C
      - is live                     - is backup                       - is backup
      - holds live.lock             - holds backup.lock               - wait to lock backup.lock
                                    - wait for live.lock                                                  
      

       

      This is should work assuming that the admin *change the configuration of A to flag it as a backup*.

      Indeed, this does not work if A is restarted as a live server.

      Let's check why. A is down, and B is the live node:

       

            A                            B                                C
      - DOWN                        - is live                       - is backup
                                    - holds live.lock               - holds backup.lock
                                                                    - wait for live.lock
       
      

       

      At that point, the client is connected to B and C is the backup in its topology.

       

      If A is restarted as a live server, we would have:

       

            A                            B                                C
      - is live                      - DOWN                         - is backup
      - holds live.lock                                             - holds backup.lock
                                                                    - wait for live.lock
       
       
      

       

      Assuming we stop cleanly server B (deleting file.lock), C would not be activated and would not become live.

      A would become live *but* the client would never know about it!

      The client was only connected to B. When the server is stopped, the client will not failover to C. And even though we could failover to it, C is not activated and would not accept any connection.

      So the client would not be able to know that A is the new live node.

       

      I don't see yet how we could make A the live node again without first flagging it as a backup and failover again and again until it is activated.

      This is the only way I see to let the clients know the correct topology at any given time.

       

      Do you have any alternative idea?

        • 1. Re: Failback to primary server with shared store
          timfox

          To simplify this:

           

          Nodes A and B, A is live, B is backup

           

          A fails, B becomes live.

           

          If you now want A to resume as live, e.g· node A is a bigger box than node B so you don't want traffic on node B for too long

           

          You set A to backup = false in the config then start it

           

          It now becomes the backup

           

          You then failover from B to A, now A is live again.

           

          Now restart B as backup

          • 2. Re: Failback to primary server with shared store
            jmesnil

            yes, that's what happens if you have only 1 backup to the primary server.

             

            My point was that if you have more than 1 backup, there is more steps.

            Either you have to stop all the waiting backups and restarts the primary server as a backup to make it the next backup to failover to.

            Or you have to failover on all the backups until it becomes live again.

             

            We need to make sure the steps to failback to primary server is well-documented.

            • 3. Re: Failback to primary server with shared store
              timfox

              Well... there is only ever a maximum of one backup. The waiting backups are not backups yet - this a different state (C is wrongly labelled as backup in your state diagram)

               

              If you have some waiting backups simply stop them first before doing fail back.