In a previous wiki article we introduced a proposal for supporting ACID transactions in a REST based environment. That protocol has availability issues (such as the need to take locks and the synchronous nature of phase one of the completion protocol) and is not appropriate for activities that may run for extended periods. Instead we propose a compensation based approach in which participants make changes visible but register a compensatory action which is performed if something goes wrong. We call the model jfdi and is based on OASIS drafts, WS-BA and WS-LRA.
Within the jfdi model, an activity reflects business interactions: all work performed within the scope of an activity is required to be compensatable. Therefore, an activity’s work is either performed successfully or undone - how services perform their work and ensure it can be undone if compensation is required, are implementation choices and not exposed to the jfdi model. The jfdi model simply defines the triggers for compensation actions and the conditions under which those triggers are executed, i.e. jfdi is concerned only with ensuring participants obey the protocol necessary to make an activity compensatable (the semantics of the business interactions are not part of the model). Issues such as isolation of services between potentially conflicting activities and durability of service work are assumed to be implementation decisions. The coordination protocol used to ensure an activity is completed successfully or compensated is not two-phase and is intended to better model business-to-business interactions. Although this may result in non-atomic behaviour for the overall business activity, other activities may be started by the application or service to attempt to compensate in some other manner.
In the model a jfdi (aka transaction) is tied to the scope of an activity so that when the activity terminates, the jfdi coordination protocol will be automatically performed either to accept or compensate the work. For example, when a user reserves a seat on a flight, the airline reservation centre may take an optimistic approach and actually book the seat and debit the users account, relying on the fact that most of their customers who reserve seats later book them; the compensation action for this activity would be to un-book the seat and credit the user’s account.
As in any business interaction, application services may or may not be compensatable. Even the ability to compensate may be a transient capability of a service. A Compensator is the jfdi participant that operates on behalf of a service to undo the work it performs within the scope of a jfdi or to compensate for the fact that the original work could not be completed. How compensation is carried out will obviously be dependant upon the service.
The model concerns compensators and a coordinator. A client starts a new jfdi via the jfdi coordinator. When a service does work that may have to be later compensated within the scope of the jfdi, it enlists a compensator participant with the jfdi coordinator. Subsequently the client closes the jfdi via the coordinator which in turn tells all enlisted compensators to either complete or compensate.
The compensator will be invoked in the following way by the jfdi coordinator when the activity terminates:
- Success: the activity has completed successfully. If the activity is nested then compensators may propagate themselves (or new compensators) to the enclosing jfdi. Otherwise the compensators are informed that the activity has terminated and they can perform any necessary cleanups.
- Fail: the activity has completed unsuccessfully. All compensators that are registered with the jfdi will be invoked to perform compensation in the reverse order. The coordinator forgets about all compensators that indicated they operated correctly. Otherwise, compensation may be attempted again (possibly after a period of time) or alternatively a compensation violation has occurred and must be logged.
Each service is required to log sufficient information in order to ensure (with best effort) that compensation is possible. Each compensator (participant) or subordinate coordinator is responsible for ensuring that sufficient data is made durable in order to undo the jfdi in the event of failures. Interposition and check pointing of state allow the system to drive a consistent view of the outcome and recovery actions taken, but allowing always the possibility that recovery isn’t possible and must be logged or flagged for the administrator. In a large scale environment or in the presence of long term failures, recovery may not be automatic. As such, manual intervention may be necessary to restore an application’s consistency.
Different usage patterns for jfdis are possible, for example jfdis may be used sequentially and concurrently, where the termination of one jfdi signals the start of some other unit of work within an application. However, jfdis are units of compensatable work and an application may have as many such units of work operating simultaneously as it needs to accomplish its tasks. Furthermore, the outcome of work within jfdis may determine how other jfdis are terminated. An application can be structured to so that jfdis are used to assemble units of compensatable work and then held in the active state while the application performs other work in the scope of different (concurrent or sequential) jfdis. Only when the right subset of work (jfdis) is arrived at by the application will that subset be confirmed; all other jfdis will be told to cancel (complete in a failure state).
The jfdi coordinator URL is:
- Performing a GET on /jfdi-coordinator returns a list of all transactions.
- Performing a GET on /jfdi-coordinator/recovery returns a list of recovering transactions.
- Performing a GET on /jfdi-coordinator/active returns a list of inflight transactions.
- Performing a DELETE on any of the jfdi-coordinator URLs will return a 401.
The JFDI (Transaction) URL
Each client is expected to have a unique identity which we'll call ClientID (it can be a URL too).
- Performing a POST on /jfdi-coordinator/start?ClientID=<ClientID> will start a new jfdi with a default timeout and return a jfdi URL of the form <machine>/jfdi-coordinator/<TxId>. Adding a query parameter, timeout=<timeout>, will start a new jfdi with the specified timeout. If the jfdi is terminated because of a timeout, the jfdi URL is deleted and all further invocations on the URL will return 404. The invoker can assume this was equivalent to a compensate operation.
- Performing a GET on /jfdi-coordinator/<TxId> returns 200 if the jfdi is still active.
- Performing a GET on /jfdi-coordinator/completed/<TxId> returns 200 if the jfdi completed successfully (a 404 response means it is not present).
- Performing a GET on /jfdi-coordinator/compensated/<TxId> returns 200 if the jfdi compensated (a 404 response means it is not present).
- Performing a PUT on /jfdi-coordinator/<TxId>/close will trigger the successful completion of the jfdi and all compensators will be dropped by the coordinator (the complete message will be sent to the compensators). Upon termination, the URL is implicitly deleted. If it no longer exists, then 404 will be returned. The invoker cannot know for sure whether the jfdi completed or compensated without enlisting a participant.
Once the jfdi terminates the implementation may retain information about it for an indeterminate amount of time.
When making an invocation on a resource that needs to participate in a jfdi, the transaction context (aka the jfdi URL) needs to be transmitted to the resource. How this happens is outside the scope of this effort. It may occur as additional payload on the initial request, or it may be that the client sends the context out-of-band to the resource.
Once a resource has the jfdi URL, it can register participation in the jfdi (ie enlist the compensator). The compensator is free to use whatever URL structure it desires for uniquely identifying itself with the constraint that it must be unique for the jfdi (ie the same compensator cannot be involved in more than one jfdi). The <compensator URL> must support the following operations:
1) Performing a GET on the compensator URL will return the current status of the compensator, or 404 if the compensator is no longer present. The following types are returned by compensators to indicate the current status:
- Compensating: the Compensator is currently compensating for the jfdi.
- Compensated: the Compensator has successfully compensated for the jfdi.
- FailedToCompensate: the Compensator was not able to compensate for the jfdi. It must maintain information about the work it was to compensate until the coordinator sends it a forget message.
- Completing: the Compensator is tidying up after being told to complete.
- Completed: the coordinator/participant has confirmed.
- FailedToComplete: the Compensator was unable to tidy-up.
The compensator registers with a jfdi by performing a PUT on the jfdi URL (/jfdi-coordinator/<TxId>) with a body that contains the <compensor URL>. The PUT request returns a unique handle/resource reference (aka RecoveryCoordinator) so that it can be uniquely reasoned about later:
2) Performing a GET on this URL will return the original <compensor URL>.
3) Performing a PUT on this URL will overwrite the old <compensor URL> with the new one supplied.
4) Performing a DELETE or POST will return a 401.
5) Performing a POST on <compensator URL>/compensate will cause the participant to compensate
the work that was done within the scope of the transaction. Performing a POST on <compensator URL>/complete will cause the participant to tidy up and it can forget this transaction. In either case the compensator will either return a 200 OK code and a <status URL> which indicates the outcome and which can be probed (via GET) and will simply return the same (implicit) information:
If the compensator is unknown (the URL is invalid) then 410 will be returned. It can be assumed by the coordinator that the service compensated.
Note, a Compensator that cannot compensate must maintain its information until it is told to forget via POST <compensator URL>/forget
6) Performing a GET on <compensator URL>/compensate will return 400.
7) Performing a PUT on <compensator URL>/compensate will return 400.
It is expected that the receipt of cannot-compensate or cannot-complete will be handled by the application or logged if not.
A compensator can resign from a jfdi at any time prior to the completion of an activity by performing a PUT on /jfdi-coordinator/<TxId>/remove with the URL of the compensator.
When a compensator is enrolled within a jfdi, the entity performing the enrol can supply a number of qualifiers which may be used by the coordinator and business application to influence the overall outcome of the activity. The currently supported qualifiers are:
- TimeLimit: the time limit (in seconds) that the Compensator can guarantee that it can compensate the work performed by the service. After this time period has elapsed, it may no longer be possible to undo the work within the scope of this (or any enclosing) jfdi. It may therefore be necessary for the application or service to start other activities to explicitly try to compensate this work. The application or coordinator may use this information to control the lifecycle of a jfdi.
This work is still at the specification stage. However, it is sufficiently similar to the specification of the atomic RESTful transactions that a refactor of that implementation should quickly produce a prototype of long running transaction specification discussed in this article.