For various reasons the recovery system cannot be migrated to a different host meaning that in failure scenarios transactions cannot always be recovered in a timely manner. Initial discussion of the problem is captured in the following JIRAs:
The Narayana project is logically comprised of a transaction manager (TM) and a recovery manager (RcMgr). When the TM decides to commit a transaction it indelibly records details of each participant involved in the transaction in an Object Store (ObjStore) (we have many object store implementations but this is not particularly relevant to the problem at hand). If the TM cannot complete the transaction, due to failures in the system, responsibility for completing the transaction passes to the RcMgr which takes its input from the contents of the ObjStore.
In a JTA environment we record representations of participants in the ObjStore as "XA records" identified by XIDs which in turn embed a node id. This node id corresponds to the TM instance that enlisted the participant into the transaction. In this way the indelible record of the participant can be traced back to the initiating TM (but more importantly it provides a way to pass responsibility for progressing the participant to a particular RcMgr).
Now conventionally we run the TM and RcMgr within a JEE app server in which the default configuration is to operate on the same ObjStore. In this configuration the RcMgr will only recover ObjStore records whose node id matches the co located TM. Generally this scheme works fine (although separating them would increase availability) since the recovery system has access to the same set of XA resources that TM does so recovery will proceed as soon as the relevant XA resources become available. Problems may arise if:
- we require a different recovery system to try to recover the same node ids;
- multiple recovery systems try to recover the same node ids;
- some of XA Resources become permanently inaccessible (either due to configuration changes or because information embedded in records contains hardcoded host addresses which no longer exist);
- the node ids are changed;
- the underlying object store becomes inaccessible.
In the following we primarily discuss JTS but REST-AT is similar. We also briefly alude to XTS and largely ignore JTA since JTA should "just work" with the proposed solution. But the real question will be: will this approach work, does it contain any show stoppers and are there any issues, XTS aside, which have not been addressed?
This post proposes a solution that will overcome each of these problems by grouping a set of transaction managers into an HA cluster with each TM configured to use the same set of datasources and a single shared object store. The recovery system runs on a single node in the cluster and operates on the shared object store. Participants recover on the same node as the recovery system (which in general will be a different node from the one where the participant last failed). This approach resolves the above mentioned problems in the following manner:
- uses an HA singleton capable of recovering all node ids. A simple approach is to configure the node id list with a wild card (i.e. all XA records become eligible for recovery regardless of the node id). This means we do not need to monitor changes in cluster membership. As the singleton migrates around the cluster it is responsible for starting the recovery system on the new node (if the singleton migrates for reasons other than node failure it is responblibe for stopping the originial RcMgr);
- ensure that only one node is driving top down recovery (this will be the HA singleton);
- document the responsibilities on the administrator i.e. each node must be configured with the same set of datasources (we log the JNDI name of the resource so as long as the JNDI names don't change we should be fine). Refer to the next section for how to resolve the third issue regarding IP addresses changing;
- provided the recovery system recovers all nodes (as required in 1 above) this is not an issue;
- use high availability storage (such as a database or a Storage Area Network) for the ObjStore or use data replication (if customers really want this we should defer to a future PRD since it is non trivial).
[Note that the default algorithm for constructing UIDs embeds the IP address of the host on which the TM is running but this is fine since we only do so in order to construct unique ids]
In JTS the recovery system and resources communicate via IORs which embed endpoint information that refer to the IP address of where the objects can be contacted. This feature creates the two difficulties:
- If a participant resource migrates to a new host then the IOR of the participant held by the RcMgr (running as a HA singleton) becomes invalid;
- If the RcMgr migrates to a new host then participants will no longer be able to call back to the recovery coordinator (as occurs during bottom up recovery) since the IOR of the RcMgr will change.
The first difficulty is resolved by bottom up recovery:- when the participant is recreated on the RcMgr node by the recovery system (note that the record that was logged at prepare time contains the information needed to do this) it will inform the RcMgr hat it has moved to a new location (using the replay_completion call on the recovery coordinator). There is a window where this scheme will fail since the RcMgr may migrate after the participant has migrated but before the participant has contacted the RcMgr. In note 2 below we mention that since the ObjStore is globally accessible and since the RcMgr IOR is written to the ObjStore, participants will always be able discover the RcMgr so it should be a straight forward modification to the RcMgr to close this window.
The second difficulty becomes an issue if the RcMgr does not have a record of the participant (can happen if the TM fails before writing its prepare log but after some of the participants recorded their prepare decision). In this case top down recovery cannot proceed, nor can bottom up recovery since the IOR of the RcMgr is no longer valid. Since the participant and RcMgr are now running on the same node it should be a fairly non invasive modification to the RcMgr to close this window.
- With this proposed solution the master node becomes a bottle neck since that is where participant resources recover but this is also a benefit since now the recovery coordinator and participants are co-located (conventionally they all reside on different nodes) thus speeding up recovery;
- The IOR of the RcMgr is written to the ObjStore so each time the RcMgr migrates it must rewrite this IOR;
- Analysis of the effect on REST-AT is similar to the JTS case (which is expected since REST-AT was modeled on JTS);
- Analysis of the effect of the solution on XTS needs to be discussed certainly participant and coordinator endpoints will no longer be valid. If the solution fails for XTS we may need to fallback on demanding IP failover to support the XTS case.
- only works in an HA cluster (not a problem since this is an EAP feature)
- JNDI names of XA resources must be duplicated on all nodes in the cluster
- the ObjStore must use storage that is accessible to all nodes and be highly available
- databases cannot be shared by Application Servers not in the cluster (see next section for details)
- An HA singleton is a bean marked with the @Singleton annotation. This bean will be responsible for starting the recovery system. Since beans are started after AS subsystems we have the problem that recovery will start much later than it does at present.
- Databases/XA Resources are typically used by more than one application server and when asked for in doubt transactions they will return all of them. This will be a problem if the returned list contains transactions created by a server that is not part of the cluster since it is then possible for two different recovery systems to recover the same transaction branch. We can resolve this problem by either demanding that resources are a dedicated to the cluster or we configure the RcMgr to only recover a given set of node ids. To implement the latter option requires that we implement a cluster membership monitor which tells the RcMgr when nodes come and go (which adds complexity to the solution).