7 Replies Latest reply on Jun 1, 2006 6:34 AM by belaban

    2PC replication and recovery

    timfox

      Hi Chaps-

      I was interesting in how you've approached the problem of 2PC replication and recovery in JBoss Cache, since I am currently approaching an analagous problem and looking at ways to solve it.

      We have a master node, and X replica nodes which we can failover over on to, each node can have it's own, non-shared persistent store (if it's shared the problem becomes a lot easier but we are considering the non shared case).

      Something happens (e.g. we add a reliable jms message) to the master node and we want to replicate that state change to each of the replica nodes in a reliable way, so we can guarantee they all get it or none of them get it.

      We can't tolerate a situation where one of the nodes gets the update but another crashes before it records it's update resulting in different nodes have different states whent they are brought back up.

      Each nodes persistent state is stored in it's persistent store.

      It seems some kind of 2PC can be used to solve this problem (I think you have done something similar?).

      So first we multicast a "prepare" to all the nodes, then if all nodes respond "ok" we multicast a "commit". otherwise we multicast a "rollback". This is all pretty standard 2PC stuff.

      There are a couple of edge conditions though:

      1) We multicast the prepare, and at least one of the participants says "no" (it's crashed say), so we multicast the "rollback" but the crashed node doesn't get it since it's crashed. So when it is brought up (maybe days later) it has some state hanging in a prepared state which needs to be rolledback. How does it know to rollback the update?

      2) We multicast the prepare, and all partipants say "yes". The commit is then multicasted, some of them manage to commit, but a node crashes before it's commit is processed. When the node comes back it, it needs to process the commit, but how does it know to commit or rollback the update.

      3) Other related edge cases, e.g. all nodes crash before commit is processed.

      So in other words, to handle all these edge cases properly, the co-ordinator needs to have a recovery protocol and a persistent transaction log.

      I.e. basically it needs to have much of the functionality of a recovering transaction manager as JBoss Transactions.

      I'm wondering does the JBoss Cache co-ordinator handle these edge cases, or perhaps you have found some other way around this?

      In JBoss Messaging we need to be able to provide a very strong reliability guarantee, so we need to cope with the recovery cases.

        • 1. Re: 2PC replication and recovery
          manik

          We have a configurabl behaviour set - if you use synchronous replication, transactions are 2PC. With async comms we default to a (less reliable) 1PC protocol.

          With our 2PC setup, as you mentioned, we have issues with the edge cases. Lets look at these in turn.

          1) 'prepare' states are not persisted, so bringing a node back after a crash, it will have it's last known 'good' state.

          2) This depends on whether your commit phase is synchronous (by default set to false). If true, we wait for all nodes to respond as expected to the commit message and if anyone fails, we broadcast a rollback. The crashed node will be rolled back to the state before the prepare (see above)

          3) This will behave as above again.

          There is another case, where the coordinator crashes after issuing the commit - but is unable to check whether all participants commit successfully. I need to check what happens in this case.

          • 2. Re: 2PC replication and recovery
            timfox

             

            "manik.surtani@jboss.com" wrote:


            1) 'prepare' states are not persisted, so bringing a node back after a crash, it will have it's last known 'good' state.



            Assuming that the transaction writes data to it's persistent store (i.e. it's not just an in-memory update), then in 2PC all the "hard work" needs to be done at the prepare stage, not the commit stage.

            This is so you know if the prepare succeeded the commit is very likely to succeed since you've done all the persisting. This is kind of the whole point of 2PC imho.

            If you don't persist at prepare, but do it at commit, then effectively you're doing 1PC, since you don't know the commit is going to succeed (database may be down etc.) until you issue the commit.


            • 3. Re: 2PC replication and recovery
              manik

              In the case of JBC the important bit is committing to teh in-memory cache. The cache loader (persistent store) is dealt with later.

              You're right though, this is a flaw and moving forward, we do have a view to mandate that all cache loaders are XA resources so that they participate in the transactions directly.

              Right now we've rolled our own prepare/commit/rollbacks for cache loaders.

              • 4. Re: 2PC replication and recovery
                timfox

                Ok thanks Manik-

                Yes, enlisting each partipating remote "destination" as different XAResources in the co-ordinating node's transaction manager, and relying on the transaction manager to handle recovery is what I currently have on the sketchpad...

                The difficult bit is fudging the 2PC protocol to work over multicast, since normally the TM would call prepare/commit etc. separately on each XAResource, rather than concurrently as would happen when we multicast it, but this is doable.

                I was kind of hoping you had approached this problem before so I could "steal" some code ;)

                • 5. Re: 2PC replication and recovery

                  Tim, are you sure you want full 2PC with recovery logging? In our 2PC protocol, we don't have the "recovery" logging, per se, since I am not sure if it makes sense, and it will be too expensive that it defeats the purpose of in-memory caching as well!

                  • 6. Re: 2PC replication and recovery
                    timfox

                    Well, for JBoss Messaging it's necessary otherwise we can't give the "once and only once" reliablity guarantee, which seems to be a stronger guarantee than JBoss Cache can currently provide.

                    This is fine, we can roll it ourselves (actually it's not hard with a good transaction manager since the tm handles all the hard work), I was just curious as to how you guys have solved the problem but I guess you haven't crossed that bridge yet. :)

                    • 7. Re: 2PC replication and recovery
                      belaban

                      This comes a little late, but 2PC in JBossCache is implemented via the usual PREPARE multicast and then ROLLBACK or COMMIT multicasts. However, we don't have to consider crashed members because we simply acquire the state from one of the other members present at restart of a failed member.
                      However, if there is a persistent CacheLoader (e.g. JDBCCacheLoader) involved, we have a problem, as we currently manage the CacheLoader ourselves.
                      In the future, we will make JBossCache an XA resource, which participates on behalf of the cluster of JBossCache instances in a TX, so on prepare(), a PREPARE will be sent out across the cluster, and the result is the set of results returned from the cluster (if 1 fails, all fail). When this is the case, we can simply enroll an XA capable JDBCCacheLoader with the TX manager, rather than with JBossCache, so the TxManager will tell the XA resource to recover(). This is something we don't currently handle with JBossCache. There is a (quite an old) JIRA issue open regarding this