Version 11

    Transaction Recovery


    This page is devoted to the transaction recovery implementation prototype the XA 2

    Phase Commit protocol.  The current implementation is an exploratory prototype to

    mostly flush out integration requirements and start to gather what tasks, issues,

    and requirements needs to be addressed in the implementation.  The current implementation performs quite well

    (17,000 commits/sec on a dual-cpu 2.4 ghz RH8 with 10K rpm SCSI)  The current implementation has been tested

    with  dummy XA Resources as well as 2 XA enabled Oracle 9i instances.


    We are currently looking for somebody to drive this implementation.  Please ask on our JTA development forum


    Also, tasks for TransactionRecovery are managed on JIRA task management




    Where's the code?  If you browse or checkout jboss-head from CVS you can find it

    under the transaction module under: org/jboss/tm/recovery.



    Before going any further, it is suggested that you read up on XA and the 2PC

    protocol.  A good article written in semi-laymens terms is listed below:



    Setting it up


    Configure core services

    The core services of transaction recovery are the RecoveryLogger and the

    RecoveryManager.  If you build from CVS, they come embedded in the

    conf/jboss-service.xml file.



       <!-- TM Recovery
       <mbean code="" name="jboss:service=RecoveryLogger">
          <attribute name="DirectoryList">${}/recovery-log</attribute>
          <attribute name="MaxLogSize">100000</attribute>
       <mbean code="" name="jboss:service=RecoveryManager">
          <depends optional-attribute-name="RecoveryLogger" 
          | The fast in-memory transaction manager.
       <mbean code=""
          <attribute name="TransactionTimeout">300</attribute>
          <!-- set to false to disable transaction demarcation over IIOP -->
          <attribute name="GlobalIdsEnabled">true</attribute>
          <depends optional-attribute-name="XidFactory">
          <!-- depends optional-attribute-name="RecoveryLogger"


    You need to uncomment the BatchRecoveryLoggerService and the RecoveryManagerService.  Also, uncomment the depends in the TransactionManagerService.


    The application server is now only partially configured to support 2PC recovery.  It is only configured to log to a tx log file.  Next you must configure what XAResources are available in your application server.


    The BatchRecoveryLogger allows you to specify a dedicated logger per directory.  Each directory you specify in the commented delimited list will have a dedicated logger created for it that does it's work in this directory.  The MaxLogSize is how many records you allow per log file. 


    Configuring recoverable XA Resources

    The JTA specification has no standard way to hook in the XA Resources that are needed for recovery.  Each application server implements its own mechanism for registering/discovering how to obtain a reference to an XAResource so that it can recover.  JBoss Tx Recovery delegates this to the application developer.  For each XA resource in your application, you must create a JBoss MBean service that handles obtaining the XAResource as well as does the recover scan.  The interface you must implement is the Recoverable interface:


    import javax.transaction.xa.XAException;
    import javax.transaction.xa.XAResource;
    import javax.transaction.xa.Xid;
    public interface Recoverable
       public String getId();
       public XAResource getResource();
       public Xid[] scan() throws XAException;
       public void cleanupResource();


    Why this design?  Even with the advent of JCA, each XA resource can differ greatly on how XAResource.recover() method is supposed to be invoked, the security Subject that is allowed to perform recovery, and even how to connect to an XAResource for recovery.  If you search on the web, you'll see there is a lot of quirkiness out there with XA drivers.  So, JBoss delegates this to you the user.


    There is an example in our CVS that sets up a Recoverable for an Oracle 9i XA driver. 



    Recovery Log design

    The currently impl is a quick implementation that definately be rewritten, refactored signficantly, or integrated with something like HOWL.  The way it is designed is as follows.


    • It first starts with the Xid format.  The GlobalId of each Xid generated by JBoss will have a base global id generated by the application server when it boots up.  The global id identifies the JBoss process instance that created the Xid.  This base global id is added to the header of any tx log file created.    The idea is that at recovery time, the RecoveryManager is only allowed to recover transactions where their Xid global id has a base global id equal to the one stored in the header of the tx log file. note: we'll need to rethink this for when the JBoss TM is a branch of a larger global tx.  Maybe we'll use a base branch id instead of a base global id.

    • When JBoss boots up, it ALWAYS creates a new tx log file, and never tries to reuse old files.  See above for why.

    • Although this isn't currently implemented, another thing that should be stored in the header is all the possible Recoverable ids.  The RecoveryManager would also, only be allowed to recover a tx log file ONLY if all Recoverables are available.  Otherwise, the tx log would be recover later.

    • One thing to avoid is both disk forcing and file writes/access.  The current log implementation ONLY writes the committing Xid to the file.  There is no TX COMPLETED  record ever written to the log file.  At runtime, the RecoveryLogger keeps track in memory all incomplete transactions.  When the log is full, the RecoveryLogger will reuse the current log file if and only if all COMMITTING transactions have been completed (AKA its incomplete transaaction list is empty).  Otherwise it creates a brand new tx log file.  How do dangling logs get cleaned up?  The old log file will be removed when all in-play transactions for it have been complete.


    RecoveryManager algorithm


    This is the process of what happens at application server boot time:


    1. The RecoveryLogger comes up.  It searches its log directories for any pre-existing log files and stores references to them.  The RecoveryManager will attempt to recover these old log files.

    2. The RecoveryLogger then creates a new log file per directory configured for it.

    3. The RecoveryLogger is registered as a RecoveryLog service to the TM

    4. Each Recoverable initializes and registers itself with the RecoveryManager

    5. After the whole application server boots up, JBoss sends a JMX Notification that the server has been fully started.  The RecoveryManager is a listener for this notification and begins the recovery process.

    6. The RecoveryManager queries the RecoveryLogger for all pre-existing log files to recover.

    7. For each TX LOG FILE to recover, The log header is read to obtain the base global id.  Eventually, a list of Recoverable ids will be obtained as well to make sure that all XA Resources are available..

    8. For each Recoverable, it calls scan (which then calls XAResource.recover) and gathers all recoverable Xids for each XAResource.  Any Xids that do not have the same base global id as any of the log files being recovered in this session are thrown out and considered managed by another tx log file.

    9. For each COMMITTING Xid in the log file, the RecoveryManager looks for matching base global ids in each of the Xid-recover-lists of each XAResource(Recoverable).  Any matching global ids will be committed by calling XAResource.commit(xid, false).

    10. If a log file completes an entire walk through, then it is considered a success and the log file is removed from disk.

    11. After all log files have been searched, any remaining Xids in the Xid-recover-list of each XAResource is rolledback by calling XAResource.rollback.


    One thing I'm concerned about is a corrupted log.  If the log is corrupted at the end of the file, this probably means that the TM crashed in the middle of a force and we can still rollback Xids that we don't find in the logs.  But, if the log is corrupted somewhere in the middle of the file, I don't think we should rollback, but rather only commit those records in the log that are not corrupted and don't rollback leftovers.  THe current impl doesn't handle either of the above scenarios.  All it does is ignore corrupted records.



    Testing, Testing, Testing

    Probably the most important thing for XA Recovery is testing.  As of the first iteration, very little automated tests have been written.  When needs to be done is list each possible failure point and to write an automated test case for each.  Let's discuss this an another XARecoveryTesting page.  Jira cases should be submitted for each test case that needs to be written.