4 Replies Latest reply on Oct 14, 2010 9:53 AM by ataylor

    Failback in new HA

    ataylor

      We need to be able to support Failback in the new HA code. Baswically this is the original live server resuming its role after it has been failed over to another backup node. typically this is for the platforms where AS nodes will have live/backup on the same server and used itypically by JEE applications. The other is for 'serious' messaging users that will create dedicated AS nodes for backup servers.

       

      Firstly i'm making a few changes to make this a bit simpler to implement. Im moving all the FIle locking code into a node manager class to clean up hornetqserverimpl. Also currently we have 3 lock files we create, live lock, backup lock and nodeid lock. instead of this i will be using 1 file and lock portions of it. I'll also be using this file to hold the node id and the state of the current live server which i will come to later. So the file will be like this:

       

      byte 1 - 1 byte server status

      byte 2 - live lock segment

      byte 3 - backup lock segment

      byte 4 to 19 - node id

       

      The locking algorithm will be pretty much the same, i.e. back up waits on live lock, potential backups wait on backup lock etc.

       

      The server status will work as follows: There will be 3 status's, L meaning a server is Live, P meaning a server is paused i.e. awaiting restart, and F meaning a live server wants to fail back.

       

      A live server starting will create the file, obtain the live lock and set the status in the file to L(ive). On normal shutdown it sets the status to P(aused).

       

      so a backup starts and waits for the lock file to be created, when it is it checks the state, if it is L then the live server must have crashed and it can come up. If it is P then the live server has been shutdown normally and we don't start up just wait again for the file to be created.

       

      Now we have this we can add management methods to allow users to stop servers and force failover, i.e. stop but leave the stat as L(ive).

       

      Now failback, if a live server is restarted we can check a flag, say allowFailback, if this is true then before obtaining the livelock (which is obviously locked) we set the status to F(ailback). Any potential backup nodes that are still awaiting the backup lock will only lock the backup lock if the status is not F. this means we can gaurantee the original live server becoming taking over from the new live server.

       

      At this point, the new live server can either be shutdown manually or maybe we could have a special cluster message that forces it. The live server has now failed back.

       

      This also makes the tests much cleaner as we can write an InVMNodeManager.

        • 1. Re: Failback in new HA
          jmesnil

          Andy Taylor wrote:

           

          Now failback, if a live server is restarted we can check a flag, say allowFailback, if this is true then before obtaining the livelock (which is obviously locked) we set the status to F(ailback). Any potential backup nodes that are still awaiting the backup lock will only lock the backup lock if the status is not F. this means we can gaurantee the original live server becoming taking over from the new live server.

          I am not understanding this part.

           

          After failover, the backup node was activated, took the lock on the live byte and released the lock on the backup byte.

          (Another backup node may then take the lock on the backup byte).

          I don't understand in your paragraph who is setting the status to F. I suppose it is the activated backup, since it is the one currently locking the live byte. Is that correct?

          So when the live server is restarted, we need the backup to release the live lock through management API. When that's teh case, the activated backup could rejoin the list of backup nodes waiting to take the lock on backup byte.

          Is it how you envision it?

          • 2. Re: Failback in new HA
            ataylor

            After failover, the backup node was activated, took the lock on the live byte and released the lock on the backup byte.

            (Another backup node may then take the lock on the backup byte).

            The backup node will only release the backup lock once it has decided to failover.

             

            I don't understand in your paragraph who is setting the status to F. I  suppose it is the activated backup, since it is the one currently  locking the live byte. Is that correct?

            A liver server that has been restarted sets this to F and then waits for the live lock. This means that when the backup is killed no other backup will take the live lock meaning we garauntee that failback occurs

             

            So when the live server is  restarted, we need the backup to release the live lock through  management API. When that's teh case, the activated backup could rejoin  the list of backup nodes waiting to take the lock on backup byte.

            Is it how you envision it?

            Yes, or we could use an extra byte in the lock to notify a backup to stop, the restarted live server would set this and the backup would check every n seconds

            • 3. Re: Failback in new HA
              jmesnil

              Andy Taylor wrote:

              I don't understand in your paragraph who is setting the status to F. I  suppose it is the activated backup, since it is the one currently  locking the live byte. Is that correct?

              A liver server that has been restarted sets this to F and then waits for the live lock. This means that when the backup is killed no other backup will take the live lock meaning we garauntee that failback occurs

              Ok, if I understood you correctly, the status byte is not locked by the active node.

              When the live server is restarted, it can change this byte value even though the activated backup is holding the live lock. Is that correct?

              • 4. Re: Failback in new HA
                ataylor

                Ok, if I understood you correctly, the status byte is not locked by the active node.

                When  the live server is restarted, it can change this byte value even though  the activated backup is holding the live lock. Is that corr

                Yes thats correct, however the only time a non live node will change it is when an ex live server wants to failback