13 Replies Latest reply on May 18, 2006 3:15 PM by Tim Fox

    Reconciling 2PC with (reliable) multicast

    Tim Fox Master

      To all you transaction gurus:

      A clustered topic exists with durable subscriptions on 3 different nodes of the cluster A, B and C, named subA, subB and subC

      We send a reliable (persistent) message to the Topic.

      The JMS spec says that we must ensure the message arrives at ALL of the durable subscriptions, it is not permissible that the message only arrives at, say, subA and subB due to failure on node C.

      To do this we can use 2PC and on the prepare stage we send the message to each node, and then we commit only if prepare succeeds.

      If a node fails then the transaction manager can do it's normal recovery procedure when the node comes back up.

      This is all well and good when we're talking to each node using unicast, but we want to have the benefit of (reliable) multicast and only broadcast the message once over the LAN.

      However if we're using multicast then effectively we'll be sending the prepare concurrently to all participants, and also the commit, which I think breaks the protocol since I think (please correct me if I am wrong) that the protocol requires individual participants to be contacted individually.

      How can we resolve this without having to have point to point communication to each subscription on the cluster?

      One possibility is to broadcast the message (the heavy data) first outside the transaction, then in the transaction use point to point and reference the message via an id.

        • 1. Re: Reconciling 2PC with (reliable) multicast
          Mark Little Master

          Why do you need transactions if you're using reliable multicast? Doesn't the reliable multicast implementation support atomic delivery? Most do, and those that don't typically rely on something like a gossip-based protocol to ensure reliable delivery eventually, but they're really intended for large scale (loosely coupled) environments.

          If it's not atomic, then what definition of reliable is used? (Even if it's causal or global ordering, I'd find it strange if it doesn't have atomic delivery).

          • 2. Re: Reconciling 2PC with (reliable) multicast
            Tim Fox Master

            We're using the JGroups RPCDispatcher which basically sends a multicast message then waits for replies from all nodes.

            While waiting for replies a node could crash so we could end up with the message on some nodes but not others.

            I don't think (Bela please correct me) JGroups will retry the RPC so the crashed node gets it when it comes back up, but I don't really know enough about JGroups to answer this properly.

            • 3. Re: Reconciling 2PC with (reliable) multicast
              Bela Ban Master

              No, the RPC is not retried, it terminates with the response from the crashed member marked as SUSPECTED.
              JGroups operates in-memory only, no logs are maintained, so when a node comes back up, we cannot do a recovery by looking at logs.
              So if you want to mimic 2PC across a cluster, then you probably have to rollback the TX if you cannot deliver the message to *all* the nodes in the cluster.
              Or, if the nodes use a shared DB, then you could say that - as long as the persistent message has been stored in the DB the TX can succeed.
              Really depends on the semantics you want to have

              • 4. Re: Reconciling 2PC with (reliable) multicast
                Mark Little Master

                So how precisely is this reliable multicast? I must be missing something ;-)

                • 5. Re: Reconciling 2PC with (reliable) multicast
                  Tim Fox Master

                  I guess the advantage over normal multicast is that you know either the message definitely reached the node, or definitely didn't.

                  If you don't know that you can't even do 2PC over the top.

                  • 6. Re: Reconciling 2PC with (reliable) multicast
                    Mark Little Master

                    Hmmm. That's multicast RPC for sure, but it doesn't imply reliability. Reliable group communications techniques have been described in the literature and implementations since the mid 1980's. What definition of reliability do we use?

                    • 7. Re: Reconciling 2PC with (reliable) multicast
                      Mark Little Master

                       

                      "timfox" wrote:
                      I guess the advantage over normal multicast is that you know either the message definitely reached the node, or definitely didn't.

                      If you don't know that you can't even do 2PC over the top.


                      Actually not true. There's no need for acks if you're using presumed abort, for instance. If the commit doesn't get an ack, the recovery system will keep trying. If the prepare isn't acked, then the coordinator may retry or simply rollback (if the participant did get the prepare but it was the ack that was lost, then it will eventually time out and roll back too). Transaction systems were running over UDP for years before TCP and layering their own ack/nack protocols on that (or not, depending on the implementation).

                      Reliable (guaranteed) delivery helps reduce the number of transactions that may roll back needlessly, but it's not required.

                      • 8. Re: Reconciling 2PC with (reliable) multicast
                        Mark Little Master

                         

                        "timfox" wrote:

                        However if we're using multicast then effectively we'll be sending the prepare concurrently to all participants, and also the commit, which I think breaks the protocol since I think (please correct me if I am wrong) that the protocol requires individual participants to be contacted individually.


                        By the time prepare ends, the coordinator needs to have responses from all participants. If it hasn't got them, it can retry. Or, you could fabricate responses locally (the only response you should fabricate is "no I can't prepare"). Thus any missing responses will cause the transaction to rollback.


                        How can we resolve this without having to have point to point communication to each subscription on the cluster?


                        You could make the transaction roll back (as above). BTW, you could always make use of the asynchronous prepare and commit aspect to JBossTS if you want.


                        One possibility is to broadcast the message (the heavy data) first outside the transaction, then in the transaction use point to point and reference the message via an id.


                        Yes, that's an option too. But then how would the receiver know that the data you've sent is to be handled in the context of a specific transaction?

                        • 9. Re: Reconciling 2PC with (reliable) multicast
                          Tim Fox Master

                          So to clarify:

                          On prepare, I multicast the prepare to all nodes. If any of them answer "no" then i return "no" as the result of the prepare to the transaction manager. This causes a rollback to be multicast to all nodes.

                          What if one of the nodes was up when the prepare was issued and stored the message in prepared state, but is down when the rollback is issued? It will never get the rollback unless the TM retries the rollback until all participants have said "yes"??

                          On commit (this is the bit I am not sure about), I multicast the commit to all nodes. Again not all nodes will necessarily be up, so the TM needs to retry too.

                          I guess there's no need to enlist each node as a separate resource - we can just enlist one resource and do our own protocol?

                          • 10. Re: Reconciling 2PC with (reliable) multicast
                            Mark Little Master

                             

                            "timfox" wrote:
                            So to clarify:

                            On prepare, I multicast the prepare to all nodes. If any of them answer "no" then i return "no" as the result of the prepare to the transaction manager. This causes a rollback to be multicast to all nodes.


                            Yes, or if you get an indication that the message wasn't delivered/acked.


                            What if one of the nodes was up when the prepare was issued and stored the message in prepared state, but is down when the rollback is issued? It will never get the rollback unless the TM retries the rollback until all participants have said "yes"??


                            Your participant has to record the reference to the transaction during prepare. During recovery, it should use this to call back to the coordinator and ask for the state of the transaction. In this case, it'll be told that the transaction rolled back.


                            On commit (this is the bit I am not sure about), I multicast the commit to all nodes. Again not all nodes will necessarily be up, so the TM needs to retry too.


                            During commit, you need to somehow tell the transaction system that the transaction didn't complete, so it can know to kick in recovery and retry later. I assume as far as the TS goes, "you" look like an XAResource and it's the XAResource implementation that is hiding the multicast?


                            I guess there's no need to enlist each node as a separate resource - we can just enlist one resource and do our own protocol?


                            That's what I thought you were doing. You enlist a single XAResource and this hides the multicasting. It looks to the TM as a single (logical) participant.

                            The problem you've got though is that the TM will then use the one-phase commit optimisation in this case: it only sees a single participant, so there's no need to do prepare. Not a good idea because really there are multiple participants. So you either need to disable one-phase commit optimization (possible in JBossTS, but not in all other implementations), or register a dummy XAResource as well, and have its prepare return OK and commit to likewise.

                            • 11. Re: Reconciling 2PC with (reliable) multicast
                              Tim Fox Master

                              It seems to be taking shape :)

                              "mark.little@jboss.com" wrote:

                              Your participant has to record the reference to the transaction during prepare. During recovery, it should use this to call back to the coordinator and ask for the state of the transaction. In this case, it'll be told that the transaction rolled back.


                              In our case (since we need to work with more basic TMs than JBoss TS) we won't hold a reference to the co-ordinator, so I guess we mark it as suspect after a time, and a sysadmin can apply a heuristic as discussed in the previous thread.


                              During commit, you need to somehow tell the transaction system that the transaction didn't complete, so it can know to kick in recovery and retry later. I assume as far as the TS goes, "you" look like an XAResource and it's the XAResource implementation that is hiding the multicast?


                              Right, so if any of the nodes fail to commit, then we return failure to the TM. It can then recover later. On recovery if a node gets the commit again and it has already committed it can just ignore.

                              I'm thinking there may be a different set of nodes up when the commit is issued to when the prepare was issued, so I can only return success from commit if i get success from all the nodes that were around when the prepare was issued. But I don't know that set (and neither does the TM since we are enlisted as a single XAResource) so I need to record this info somewhere I guess. Hmm



                              • 12. Re: Reconciling 2PC with (reliable) multicast
                                Mark Little Master

                                 

                                "timfox" wrote:
                                It seems to be taking shape :)


                                Good. We don't want to let this stretch to 7 pages though ;-)



                                "mark.little@jboss.com" wrote:

                                Your participant has to record the reference to the transaction during prepare. During recovery, it should use this to call back to the coordinator and ask for the state of the transaction. In this case, it'll be told that the transaction rolled back.


                                In our case (since we need to work with more basic TMs than JBoss TS) we won't hold a reference to the co-ordinator, so I guess we mark it as suspect after a time, and a sysadmin can apply a heuristic as discussed in the previous thread.


                                That's a very bad idea. It's one thing to say you don't depend on JBossTS (or any transaction service), it's another to not use the features and capabilities it provides when you're running within it. The lowest common denominator approach is not the right one IMO, particularly in this case: you don't want to guess as to the transaction outcome and cause heuristics. I would recommend that you save the coordinator reference if you've got it and do the right thing in that case. If you don't have a coordinator reference (because the implementation of the TM doesn't give you one) then you do as good a job as you can, but then punt back to the TM implementer to fix their side!




                                During commit, you need to somehow tell the transaction system that the transaction didn't complete, so it can know to kick in recovery and retry later. I assume as far as the TS goes, "you" look like an XAResource and it's the XAResource implementation that is hiding the multicast?


                                Right, so if any of the nodes fail to commit, then we return failure to the TM. It can then recover later. On recovery if a node gets the commit again and it has already committed it can just ignore.

                                I'm thinking there may be a different set of nodes up when the commit is issued to when the prepare was issued, so I can only return success from commit if i get success from all the nodes that were around when the prepare was issued. But I don't know that set (and neither does the TM since we are enlisted as a single XAResource) so I need to record this info somewhere I guess. Hmm


                                Yes, you'll need to persist that information in your XAResource wrapper and only return commit success once all references have been ticked off your list.

                                • 13. Re: Reconciling 2PC with (reliable) multicast
                                  Tim Fox Master

                                   

                                  "mark.little@jboss.com" wrote:


                                  Good. We don't want to let this stretch to 7 pages though ;-)



                                  We should be ok. We have 5 to go ;)



                                  That's a very bad idea. It's one thing to say you don't depend on JBossTS (or any transaction service), it's another to not use the features and capabilities it provides when you're running within it. The lowest common denominator approach is not the right one IMO, particularly in this case: you don't want to guess as to the transaction outcome and cause heuristics. I would recommend that you save the coordinator reference if you've got it and do the right thing in that case. If you don't have a coordinator reference (because the implementation of the TM doesn't give you one) then you do as good a job as you can, but then punt back to the TM implementer to fix their side!



                                  Point taken. As long as we make transactionmanager specific code pluggable we should be ok.



                                  Yes, you'll need to persist that information in your XAResource wrapper and only return commit success once all references have been ticked off your list.


                                  Ok this easy with JBoss TS as long as the XAResource is serializable. But otherwise I'm going to need to log this myself.