3 Replies Latest reply on Sep 16, 2011 9:02 AM by andrey.vorobiev

Question about distributed transactions recovery

andrey.vorobiev Sep 7, 2011 5:59 AM

Hi guys.

Does anybody has an experience with following usecase:

I have two jboss instances running in a cluster. Lets assume that following situation occurs:

Node1 starts distributed transaction which involves two datasources.
Prepare phase successfully passes and right after that node1 dies.
Transaction info is stored in ObjectStore and is awaiting for node1 restart in order to successfully be committed.

Is it possible to configure JBoss Transaction manager in way that makes node2 possible to commit prepared transaction started by node1? Actually I am interested in case of n nodes (any of live nodes in cluster can commit prepared transaction of dead nodes).

1. Re: Question about distributed transactions recovery

wdfink Sep 7, 2011 6:15 AM (in response to andrey.vorobiev)

I think not.
First problem is to transfer the information from filesystem to filesystem.
Other thing is the database connection. AFAIK there is no possibility to commit the transaction after the connection is down (I use Oracle).
The Tx will be rolled back, maybe if you have a XA one and some (but not all) commits are processed you will have a inconsistence.
Actions
2. Re: Question about distributed transactions recovery

jhalliday Sep 7, 2011 7:34 AM (in response to andrey.vorobiev)

Sorry, no magic fairy dust. If you think through the details you'll find this is a really hard problem. If you are very careful it can be made to work for a limited number of use cases, but it's normally more trouble than it's worth.

First off, the 'cluster' thing is a red herring. The transaction service is not cluster aware and does not care that other services in the container may be running clustered. The problem is the same with a group of non-clustered servers as with clustered ones.

Secondly, what do you mean by 'distributed transaction'? A transaction that merely spans multiple XA resource managers is different to one that propagates transaction context between JVMs. In the former case the identification of communication endpoints is a matter for the client libraries containing the XAResource implementations, whilst in the latter it's additionally tied to the transaction coordination transport layer i.e. either IIOP (for JTS) or Web Services (for XTS), either of which ultimately means endpoints identified by IP addresses.

Let's start with the comparatively easy case of a local JTA transaction involving two XAResource managers, dbOne and dbTwo. Let us further assume two application servers AS1 and AS2, each with its own object store, ostore1 and ostore2 respectively, on local storage. This is probably the most common configuration around, representing a typical load balanced environment.

In the case that AS1 fails due to a crash of the underlying hardware or O/S, we're out of luck - the log files are inaccessible until the O/S is restored. We can sidestep this one by putting the log storage on a network device e.g. SAN, but the performance hit is going to hurt.

In the case that the AS1 JVM process crashes whilst its host operating system remains running, we may be in with a chance of using log shipping and proxy recovery. However, why bother? Login to the O/S and restart AS1. Better still, have a watchdog process on the O/S that will detect the failure and automatically restart AS1. This is pretty basic sysadmin stuff.

But let's assume that for same strange reason AS1 is expected to stay permanently dead, or at least dead for a minimal interval. This is important because we can't tolerate having it restart and attempt recovery itself if we're simultaneously having another node recover the logs on it's behalf. So if we're going to ship the logs over to another node for recovery we first have to make sure AS1 stays dead.

AS2 has log records of its own and we can't mess with them, so we now have to merge the content of ostore1 into ostore2. Fortunately for us, the default filesystem store uses one file per tx, so we just copy the inbound records from ostore1 into the directory tree of ostore2 using scp -r and we're done. This has to be a manual step because, even the the transaction system on AS2 was clustered and aware that AS1 had failed, it would has no remaining communication endpoint to contact to get the files. File names are based on uids that should, if the environment was set up right in the first place, ensure that nothing gets overwritten. If we're using an alternative objectstore implementation it's harder as we can't rely on filesystem level tools to do the merge. In such case it may become necessary to shutdown AS2 in order to do the merge whilst the system in quiesed. Since AS2 is our last remaining AS instance that's probably not tolerable.

So, now we have the files from ostore1 merged into ostore2. The next recovery pass will attempt to activate and complete the transactions. To do that it will need to obtain new XAResource instances, since those are not typically serializable and the only information in the logs is therefore the Xid. So, AS2 is going to have to have a set of XA datasources that contains all those that were available in AS1. This is not as tricky as it sounds, since the odds are that all the load balanced nodes in such a setup are homogeneous with respect to such configuration. If they are not then the deployment and update steps performed by sysdamins need to encompass ensuring that any xa-dataource config files and drivers are copied as needed.

If our luck is holding then the next recovery pass will use the datasources in AS2 to attach to the xa resource managers and commit the transactions. But we're not done yet.

The crash of AS1 may have occurred in the time window between the resource manager preparing the tx branch and the transaction manager writing a log record for it. In such case the RM is still holding locks on the data, but there is no tx log to ship over for AS2 to recover. So that branch is going to remain in doubt until AS1 recovers. Except it won't recover - recall we sabotaged it to stop it recovering whist the proxy recovery was running on AS2. So now what? We need AS2 to additionally take responsibility for rolling back the orphan tx branches owned by AS1. That's what the xaRecoveryNodes list of values is for. In a normal configuration each node takes responsibility only for its own transactions, but if we append AS1's node id onto AS2's xaRecoveryNodes list then it will recover orphans on behalf of AS1. Just one snag - the list is not mutable at runtime, so we'll need to take AS2 down to make the change. Does this problem sound familiar?

As you'll have gathered it is technically feasible to do what you want for JTA transactions, but you really need to consider if it's worthwhile. You'll probably have to get AS1 going again anyhow, so what advantage do you gain? We've occasionally considered adding further automation to make the process somewhat easier, but never been convinced there is a widespread use case to justify the implementation work. Until we are, if you're determined to implement a proxy recovery plan, make sure you document the steps thoroughly and test it regularly to e.g. detect cases in which the necessary datasource definitions or drivers may be missing in one or more nodes. It not something you want the ops staff to be attempting for the first time in the middle of a partial outage situation that's already stressing them.

I could describe the additional problems that arise when trying to do this with true distributed transactions (JTS/XTS), but I figure I've given you more than enough to worry about already.
1 of 1 people found this helpful
Actions
3. Re: Question about distributed transactions recovery

andrey.vorobiev Sep 16, 2011 9:02 AM (in response to jhalliday)

Thanks for reply.
Actions

Go to original post