Three-phase commit protocol

Version 1

    Since reading Mark Little’s article [1] I have been trying to find time to further understand where three phase commit can be applied. I also read a nice summary article on Wikipedia[2] but was still missing the point. After I finished my earlier article on 2PC[3] I decided to investigate a little bit more on 3PC for my benefit.

     

    This is my humble summary of my understanding.


    When hearing about three-phase commit protocol you can observe it’s an upgraded two-phase commit for one more phase. But why any upgrade of the battle tested 2PC?

     

    About commit protocol

     

    Commit protocol ensures the atomicity of decision across multiple participants. In other words protocol ensures that all participants are either all committed or all aborted.
    Coordinator - (a master) it’s the transaction manager. It drives the transaction commit protocol to end successfully. If a failure occurs it commands all participants to abort.
    Participant - (a slave) it’s driven by the manager to do its work by finishing transaction by commit. When commanded to abort is has to be capable to revert the work to get the state

     

    Blocking consensus protocols

     

    The main motivation for the 3PC algorithm is to address the fact that two-phase commit protocol is blocking in case of a site (site means coordinator or participant) failure.


    The term blocking refers to the process of reaching the consensus on transaction commit[4]. The 2PC processing is blocked when the protocol reaches the prepare phase and the coordinator fails (e.g. it crashes, is unavailable because of network failures...). There is no way to finish with commit until coordinator and all participants are available. In other words the participant can’t reach a final state in presence of a failure.

     

    It would be nice to start a new different coordinator, responsible for handling recovery - ie. with functionality of the recovery manager, and to expect it learns how to finish the transaction from the environment around - but this is not the case of two-phase commit where you need to gain the state belonging to the old coordinator for being able to finish the transaction. During the time participants are imprisoned in the prepared state and they are not expected to change the state on their own.

     

    In practice the XA Specification[5] (specification for implementation of two-phase commit protocol) defines a heuristic state where participant can spontaneously decide to change its state (not obeying the transaction manager decision even when it’s already prepared). In such case the protocol can’t continue reliable working thus is stopped, the transaction is moved to a heuristic state and manual intervention is required. This practical change of the protocol enhances the blocking nature of the 2PC - it avoids the resources being locked permanently. But it has the drawback in the potential(!) loss of consistency.

     

    Another enhancement of 2PC to eliminate some of the blocking behaviour during recovery is the presumed abort and presumed commit optimizations. These optimization predicts that if on recovery there is no evidence about commit of the transaction the recovery procedure assumes the transaction was aborted (or committed) and act base on it.

     

    A non-blocking consensus protocol


    The three-phase commit protocol is said to be non-blocking. The blocking nature of the two-phase commit protocol is alleviated by introducing one more phase [10]..

     

    3PC splits the prepare state in two[2]. The rationale behind this is that the reachable final state for the participant in prepare state in 2PC is commit and abort (both possibilities). Splitting state to waiting and pre-commit in 3PC we got to the situation that there is only one final state for each of them - abort from waiting, commit from pre-commit. This new state should avoid need of the external input - ie. from coordinator.

     

    If a participant is in pre-commit state we know that all participant acked the initial coordinator canCommit query by ack vote and they are waiting to commit. Thus the pre-commit state could be marked as “commitable”. Because of this structure participants can collectively decide the overall result of the transaction in case the coordinator fails.

     

    3PC as non-blocking does not mean that the participants are not blocked in processing. If we think about database processing as the, let say, the most intuitive, the database has to start a local transaction and put locks on appropriate places when process the transaction and when agree to commit. Thus other operations could be blocked by this effort and needs to wait till the whole 3PC ends.
    The non-blocking means that protocol can proceed despite of existence of failures.

     

    The states of the 3PC are depicted at the following state diagram. The diagram is took over from work at [11]

     

     

    The state transition depicted with the solid line is normal execution when no crash or error happens. The transition depicted with the dot-dash line defines state where the coordinator/participant is ready to work but it waits for responds. Message was sent (e.g. from coordinator to participant but there is no response yet, or participant waits for next command). Then after predefined time the timeout occurs and the site follows the diagram transition. The dash line defines occurrence when the site crashed and now comes to live. Depending the environment it decides to change the state appropriately. The other rules are discussed below.

     

    The protocol defines timeouts. If the participant is not commanded by coordinator to finish the transaction in one direction (to commit or to abort) and timeout expires it follow in decision by defined rules and it’s not blocked to wait for coordinator to command.

     

    If coordinator crashes the 3PC expects a new coordinator to be elected. Because of the protocol the new coordinator is able to finish transaction only by querying participants - without need to know the state of the original coordinator (election protocol to find a new coordinator - possible from the available participants - is up to the implementation).

     

    The three phase commit protocol is not a salvation for all failure cases that the system can moved to. The original 3PC assume synchronous networks of fail-stop model. Fail-stop means that a fail can be caused only by crashing a node. There is no network partitions[6] or asynchronous communication (the model with partitions is called fail-recover).

     

    We should consider 3PC rather a family of protocols rather than a single well defined one. There are several implementation proposed - see [7],[8],[9], discussion at [4]. For example there is a protocol enhancement E3PC (enhanced 3PC)[7] which state to eliminate the issue of the network partitions.

     

    The protocol

     

    Let’s draft the 3PC protocol.


    The 3PC defines the following states:

    • initial (the 3PC processing is starting),
    • waiting (participant is available to commit , it received canCommi’ message from the coordinator),
    • pre-commit (participant is ready to commit and received preCommit message from the coordinator),
    • committed (participant is committed, it was commanded by coordinator to commit).

     

    It also has the following phases:

     

    • Phase 0: The start of the processing is the same. The business logic sends a message to the JMS queue and inserts a record to the database table. Each of them starts its local transaction which is enlisted to the global managed by coordinator.
    • Phase 1: The three phase commit starts when application logic says to commit the work. Now the coordinator commands both participants to switch to waiting by sending canCommit message and changes its own state from initial to waiting . Here we can expect participants locks their resources (as happens in case of parepare command in 2PC).
    • Phase 2: Participants acknowledged that are available to commit and coordinator sends preCommit message to both of them. Coordinator moves to pre-commit state.
    • Phase 3: Coordinator collects the acks on preCommit message and it commands both participant to commit. Participants are committed and resources are released.

     

    The diagram above (originated at [11]) visually points to the fact why the new phase was added[13]. You can see the final transition from the waiting state is to abort which is in opposite to the final transition of the pre-commit state which is default to commit. There is no state which would contain transition to both final states - to commit and to abort simultaneously. The structure of the protocol ensures that neither the participants nor the coordinator are further than one state transition from each other. That permits participants to decide - when the coordinator is not available - the final state of the transaction. When any participant is in the pre-commit state the transactions is about to commit - participants can be sure that coordinator decided to commit before. When there is no participant in pre-commit state the transaction is about to abort - participants know it’s possible that coordinator decided to abort.

     

    Dealing with failures

     

    What happens then in case of failure? [12][13][14]

     

    The failure of participant is observed by timeouting the coordinator while waiting the participant response.

     

    • If timeout occurs for waiting state we know that there are some participants in initial state or/and waiting state. Coordinator commands to abort.
    • If timeout occurs for pre-commit state we know that there are some participants in waiting state or/and in pre-commit state. Coordinator commands to abort.
    • If there is a participant which does not receive the abort message we are fine as upon the recovery, the participant decides depending on the state of other participants.


    The failure of coordinator is observed by timeouting the processing while participant waits to be commanded for an action. The behavior of the participant is the same in waiting and pre-commit states. If timeout occurs there is need to find a new coordinator which verifies what is the state of the participants. The action of the new coordinator is depicted on the diagram[11] as the failure step.

     

    • The new coordinator is marked being in waiting state when participants are in waiting or aborted state. Then the new coordinator commands to abort.
    • The new coordinator is marked being in pre-commit state when other participants are in waiting, pre-commit or commit state. Then the new coordinator commands to commit.

     

    Notice that

     

    • Coordinator can’t forget about transaction till all participants acknowledged that they processed the required action. We can depict it as a new forgotten state connected from commit and abort states.
      Only when all acknowledgements are received then the information about transaction existence could be removed from the coordinator log store.

     

    • The participant upon recovery can’t decide to commit a transaction even if it’s in the pre-commit state. That’s because coordinator could decide to abort the whole transaction when participant changed to pre-commit state as the only one (other participants are still in waiting state) before it was capable to send acknowledgment back to the coordinator. Coordinator then commands all participants to abort.
      Thus in this case, the participant must ask the other participants to find out the state of the transaction.

     

    Summary

     

    In summary 3PC alleviates the blocking nature of the 2PC but it’s this comes at the cost of a more complicated protocol and brings the burden of another message needing to be sent even in case of sunny scenario. Furthermore it does not solve the issue with network partitions. Given these limitations it is somewhat expected that we find 3PC is not widely used in practice.

     

    [1] http://jbossts.blogspot.cz/2013/05/2pc-or-3pc.html (2PC or 3PC)
    [2] https://en.wikipedia.org/wiki/Three-phase_commit_protocol (Three-phase commit protocol)
    [3] https://developer.jboss.org/wiki/TwoPhaseCommit2PC (Two phase commit (2PC))
    [4] https://www.microsoft.com/en-us/research/publication/consensus-on-transaction-commit (Consensus on Transaction Commit)
    [5] http://pubs.opengroup.org/onlinepubs/009680699/toc.pdf (Distributed Transaction Processing: The XA Specification)
    [6] http://the-paper-trail.org/blog/consensus-protocols-three-phase-commit (Consensus Protocols: Three-phase Commit), http://the-paper-trail.org/blog/consensus-protocols-paxos (Consensus Protocols: Paxos)
    [7] http://webee.technion.ac.il/~idish/Abstracts/jcss.html (Increasing the Resilience of Distributed and Replicated Database Systems)
    [8] http://www.win.tue.nl/~atif/reports/paper4ICET.pdf (Analysis and Verification of Two-Phase Commit & Three-Phase Commit Protocols)
    [9] https://pdfs.semanticscholar.org/c7c6/e4a790582b776eca168f60ca27d4bd8d2a3a.pdf (A More Committed Quorum-Based Three Phase Commit Protocol)
    [10] http://ieeexplore.ieee.org/document/1703048 (A Formal Model of Crash Recovery in a Distributed System)
    [11] http://courses.cs.vt.edu/~cs5204/fall00/distributedDBMS/sreenu/3pc.html (Three-Phase Commit Protocol)
    [12] http://sodwana.uni-ak.ac.at/geom/pdfs/rs_mabe3pc.pdf (MaBE Agents and n-Phase-Commit)
    [13] https://www.researchgate.net/publication/275154978_Three-Phase_Commit (Three-phase commit)
    [14] http://www.inf.fu-berlin.de/lehre/SS10/DBS-TA/folien/08-10-TA-3PC-1.pdf (slides: Fault Tolerant Distributed Transactions)