6 Replies Latest reply on Aug 30, 2009 3:36 PM by bmmc

    Backup to Primary fail-back procedure?

    bmmc

      We have tested the primary node to backup node failover such that clients reconnect well enough; however, we are not clear as to what the expectation is for a procedure to get the clients back to the primary once it is restored.

      If we recover the primary and then shutdown the backup (for example), should client attempt to reconnect to the primary automatically? Possibly I am just missing a configuration step somewhere (I hope so) but that doesn't seem to happen.

      I vaguely remember reading somewhere that currently the backup does not try to sync messages/sessions/connections back to the primary when it is recovered, so what should the failback procedure be to have zero (or at least minimal) downtime for the clients?

      Thanks for the advice in advance.

      BMMc.

        • 1. Re: Backup to Primary fail-back procedure?
          clebert.suconic

          That is something being improved on the next release.

          For now, you need to manually copy the data after failover, like redirecting the traffic on that node to another node on the cluster, and then restarting the whole pair.

          I know this is not the ideal for GA, but this is something being worked ATM.

          • 3. Re: Backup to Primary fail-back procedure?
            bmmc

            I re-read the doc section but I am still confused as to how exactly we should be failing-back clients. There's not much talk about it in the doc I don't see.

            Should we shutdown the clients and copy the data from the backup node to the primary node then start the clients backup expecting them to connect to the primary?

            If these nodes are participating in a cluster and we cleanly shutdown the backup after a primary has catastrophically failed, will the clients cleanly try to re-connect to another node in the cluster? If this were possible, I think this would solve most problems because it will allow us time to move the data from backup to primary without the client having downtime and still remain ACID.

            I don't think you are suggesting that we move the data file while clients are still connected to the backup because I don't see how you would ever get an accurate snap-shot of state because objects would still be shifting around.

            Any help would be appreciated,

            BMMc.

            • 4. Re: Backup to Primary fail-back procedure?
              bmmc

              I think this is the statement I don't fully understand:

              "redirecting the traffic on that node to another node on the cluster"

              How do you administratively direct the clients to another node on the cluster while the applications are still running? I think that is all I am really missing.

              Thanks,

              BMMc.

              • 5. Re: Backup to Primary fail-back procedure?
                timfox

                Hi bmmc-

                There currently is no way to re-instate a live node with a new backup node while it is running. So, if a node fails over to its backup, that backup node becomes live and has to live for some time with no backup, until you can take it down and bring it back up with a backup. (Actually that is no different from ActiveMQ)

                Before GA we're going to be making some significant changes in the area of replication and failover, and I hope to address this.

                I agree that we really need a seamless process for adding a backup to a live node so we can continue with zero downtime after failure.

                • 6. Re: Backup to Primary fail-back procedure?
                  bmmc

                  Thanks for the information. Currently this is not a show stopper for us I just wanted to fully understand my options. Since there are no requirements to use this right now I think we will just come back to it after GA.

                  Thanks,

                  BMMc.