5 Replies Latest reply on Jun 19, 2017 1:14 PM by jbertram

    Does AMQ7.0 guarantee no message loss for HA cluster using replication.

    hellosmith420

      Hello,

      Right now, I have started to look in AMQ7 as an alternative to RabbitMQ.

      In my enterprise requirement, HA is very important.

      So, I am interested in replication to achieve HA as shared storage configuration is time-taking and complex in my cloud architecture.

      But, I need to know this information, if there is any chance of message loss in any scenario of HA over replication.

      As message stored in master broker only, if that crash, how to recover messages?

      I did not find any straight answer in HA section of AMQ documents.

       

      Thanks.

      John

        • 1. Re: Does AMQ7.0 guarantee no message loss for HA cluster using replication.
          martyn-taylor

          John,

           

          The TL;DR is yes, we do have guarantees around messages loss.

           

          There are a couple of related questions here, the answer is not straight forward.  I've tried addressed each component to your query below.

           

          Q. As message stored in master broker only, if that crash, how to recover messages?

           

          A. When replication is enabled and a live and backup are paired, the live broker will ensure that the backup has a copy  of the message before taking ownership.  In other words, the live would never send an ack back to the client until the backup has a copy of the message and has saved it to disk.  Note that only messages that are marked to be persisted will be replicated, e.g. durable messages and QoS at least once, exactly once.

           

          Q. If there is any chance of message loss in any scenario of HA over replication.

           

          A.  This may be obvious, but I wanted to be clear as sometimes this causes confusion.  There are always failure scenarios, in which any system could lose data integrity i.e. message loss.  Perhaps, a better question to ask is what failures can a system based on AMQ7 replication tolerate whilst maintaining data integrity.  I've outlined some typical scenarios below:

           

          1. Single node failure (Non HA configuration, persistence enabled).  Providing the disk survives the node failure, the AMQ7 broker is able to recover all state from disk and data integrity is preserved.  Obviously availability will not be preserved as the node is no longer able to service requests.  If your disk is corrupted then data integrity may not be preserved.

           

          2. Single node failure (HA replication, persistence enabled).  Data will be replicated across a live/backup pair.  If *one* node fails, availability and integrity will be preserved.   If both nodes fail, availability is lost, and if both of the disks are corrupted data integrity is lost.

           

          3. Multiple node failures (HA replication).  It is possible to configure a pool of backups.  If a node fails, it's backup will take over and start servicing requests.  One node in the pool of backups will become the backup for the new live node and start replicating, meaning multiple node failures can be tolerated.  Providing they're not all at exactly the same time.  

           

          Note: There is a small time window, between when the live node is started and when backup becomes active before data is fully replicated.

           

          There are obviously many other failure scenarios, e.g. how to tolerate network failures etc...  It is possible to tolerate many of them, it's largely up to the deployment architecture.  I'm sure the same scenarios crop up when configuring shared store solutions. 

           

          I hope this answers your question.

           

          Please check out the HA section of the user guide for more information on this.

           

           

          Regards
          Martyn

          • 2. Re: Does AMQ7.0 guarantee no message loss for HA cluster using replication.
            hellosmith420

            Hello Martyn,

            Thanks for you prompt and detailed answer.

            Please excuse for any irrelevant question and my knowledge.

             

            You have mentioned "When replication is enabled and a live and backup are paired, the live broker will ensure that the backup has a copy  of the message before taking ownership"

            Here, the term backup is meaning slave broker?

            In that case I get following from AMQ broker guide.

            "All persistent data received by the master broker is synchronized to the slave when the master drops from the network. A slave broker first needs to synchronize all existing data from the master broker before becoming capable of replacing it. The time it will take for this to happen depends on the amount of data to be synchronized and the connection speed."

            So, it tells, synchronization is not real-time, but happens once master get down.

             

            Thanks,

            John

            • 3. Re: Does AMQ7.0 guarantee no message loss for HA cluster using replication.
              jbertram

              "All persistent data received by the master broker is synchronized to the slave when the master drops from the network. A slave broker first needs to synchronize all existing data from the master broker before becoming capable of replacing it. The time it will take for this to happen depends on the amount of data to be synchronized and the connection speed."

              So, it tells, synchronization is not real-time, but happens once master get down.

              I can somewhat understand why you might interpret this bit of documentation as saying that synchronization is not real-time, but happens once the master goes down.  However, that's not an accurate interpretation of the documentation here.  What the documentation is really saying here is 2 things:

              • At the point the master drops from the network it will have already synchronized all persistent data it has received to the slave (assuming the slave has finished the initial synchronization phase - see more in the next point).
              • When a slave first connects to a master it must synchronize all the data from the master (i.e. messages already in the journal as well as messages which arrive after it connects).  How long this initial synchronization phase lasts will be dictated by the amount of data to be synchronized and the speed of the network connection.  Not until the initial synchronization phase is complete will the slave be a viable replacement for the master since it won't have all the master's data.

               

              Furthermore, it really doesn't make sense for the master to synchronize its data with the slave once it drops from the network as it relies on the network connection to perform the synchronization in the first place.

               

              Does that clarify?

              1 of 1 people found this helpful
              • 4. Re: Does AMQ7.0 guarantee no message loss for HA cluster using replication.
                hellosmith420

                Sorry, for late reply.

                Justin, thanks for your detailed clarification.

                Yes, I got my answer.

                Basically, when slave connects, synchronization starts. In reality, slave would be started with master in prod . So, no issue.

                Though my next sentence is not directly related to this question but related.

                Does, replication create impact on performance due to copy messages vs shared storage?

                Basically, an apple to apple comparison between replication and shared storage is expected in developer guide for shake of performance would be really appreciated and that will help us to upgrade existing system. Allocation of shared storage is really a painful and time taking task in "messaging as a service" platform.

                 

                Thanks,

                John

                • 5. Re: Does AMQ7.0 guarantee no message loss for HA cluster using replication.
                  jbertram

                  Does, replication create impact on performance due to copy messages vs shared storage?

                  Replication requires physically copying the data from one broker to another over the network (as the documentation describes).  Shared storage doesn't require any such work by the broker.  Whether or not the additional overhead impacts performance in a statistically significant way will depend on the environment (e.g. network speed) and your use-case. 

                   

                  Basically, an apple to apple comparison between replication and shared storage is expected in developer guide for shake of performance would be really appreciated and that will help us to upgrade existing system.

                  What specific things would you expect to find in this comparison?  As I mentioned previously it's impossible to quantify the performance impact of the various broker configuration choices.  The documentation could certainly discuss things in relative terms (e.g. copying data will be slower than not copying data), but there are so many factors involved in overall performance that it's impossible to know whether that relative difference will actually have a meaningful impact.