> At the moment to me it looks like an error in the JBossTM
Um, no. It's more of a difference of opinion in the interpretation of the XA spec. I'll reiterate my position: This is a bug in the MS code.
Whilst the spec mandates that the data portion alone be uniq, JBossTS nevertheless uses the formatId also when comparing Xids for equality. This is on the basis that Xid values are strongly typed. You can't meaningfully compare raw byte sequence for equality if e.g. one is in little-endian and one big-endian, or one IEEE floating point and one integer, or one 4x8bit values and the other 2x16bits values concatenated. Value comparison must take account of type information also. MS is assuming that it can arbitrarily alter the formatId and still expect the TM it is interacting with to consider the mutated value to be equal to the original. JBossTS does not see the world that way. Until the difference of opinion is settled you can't expect JBossTS recovery to work with MS.
I didn't mean the formatId part, but the fact that something is actually canceling the transaction. Why would it do that? Or does it consider that RECOVERED transaction as wrong one and then rollbacks it as a safety precaution or what?
As for JBossTS <-> SQL Server not working, it's still in your supported environments on SOA-P platform, but that's a whole different issue then (and perhaps not on this public forum) ..
Ah well, I'll make bug report to MS "premium support" and see what they reply.
You would have to step it in a debugger to be certain, but the expected behavior, consistent with your logs, would be for the transaction manger to consider the Xid an orphan on the basis of having a matching node name but no matching transaction record. Thus it gets rolled back under the presumed abort rule. You may be able to hack in a behavior change by substituting your own implementation of JTATransactionLogXAResourceOrphanFilter into the filter chain, but that may have side effects on other non JTA transaction types.
(05:00:55 PM) mmusgrov: gaYak: Thanks for the extra debug log. It does seem to corroborate what Jonathan said above:
(05:01:01 PM) mmusgrov: We prepare the transaction and get OK response back from the SQL server resource manager. We write a transaction log and encode an XID into the log using a formatId that is unique to our TM.
(05:01:09 PM) mmusgrov: In the meantime there is a recovery scan where we ask the RM for its current list of xids. One of the ones it returns corresponds to the prepared transaction (but the RM has cleared the formatId field).
(05:01:13 PM) mmusgrov: We extract the formatId from the xid but it does not match our formatId that we used to encode the log record we wrote after prepare succeeded. The xid is therefore eligible for recovery which in this case we use presumed abort. If SQL server had left the formatId as it was our TM would not try to recover it and the transaction would have completed as normal without error.
(05:01:19 PM) mmusgrov: Then the commit runs which the RM fails since it has just been instructed to roll back, hence you get the heuristic outcome.
(05:01:25 PM) mmusgrov: Unfortunately I cannot see any workaround for the issue in JBossTS_4_6_1 (since JTATransactionLogXAResourceOrphanFilter is only in a later release)
(05:02:23 PM) mmusgrov: gaYak: I'll update the forum post to reflect this analysis.
(05:27:01 PM) mmusgrov: gaYak: It is definitely an issue with the RM: the formatId field says how the xid data is encoded. JBossTS created the xid with a private encoding. If the RM sets the formatId to zero to indicate OSI CCR encoding rules then it is misleading since the data is obviously encoded using a private format - it is just plain silly.
Jonathan Halliday wrote:
It's not, nor is using 0 an option - the spec reserves that value for OSI CCR naming. This is a bug in the MS code. I've raised it with them before and been brushed off - they don't seem to regard it as a problem. Perhaps you'll have more luck convincing them to fix it.
I understand that this wouldn't suffice as real fix, however, after talking with mmusgrov and the behaviour more I wanted to ask, is this a possible workaround? The proposed class you mentioned earlier is in the later versions of JBossTS, however I'm using 4.6.1. Assuming my system has only SQL Server XA resources and Messaging, would it work as temporary workaround to patch the FORMAT_ID to 0 ? Or would it break something? Since SQL Server ignores the specification anyway, it shouldn't break anything to that direction, but make the recovery process / scanning process to function properly.
Or is there a better temporary workaround in 4.6.1? Even if it's a kludge, it would still be better than non-functioning system ..
> is this a possible workaround
Short Answer? No.
Now go get some coffee and prepare for the long version...
Xids are required by the spec to be globally uniq. In other words, they are guids. Generating globally uniq byte sequences in a totally distributed fashion is impossible. Some degree on centralized control is required. This generally takes the form of a) agreeing some structure for the sequence and b) coordinating assignment of values for some part of that structure between the distributed parties. For example, ethernet MAC addresses are standardized to begin with a byte sequence which is a globally assigned manufacturer code. The following bytes are then at the discretion of the specific manufacturer, who can use whatever scheme they like, e.g. productline+sequencenumber, to get uniqness within that scope. As another example, IP address space is handled by recursive division down a hierarchy of registry authorities. Either way, the overall result hits a balance between centralized coordination and distributed concurrency.
Now consider Xids specifically. It's not possible to generate a globally uniq id unless you do so in a) some defined format and b) without treading on anyone else's toes. Unfortunately the XA spec does not provide for a global registry of formatids, but it does allow for 0 to be used to describe a specific way of structuring Xids. In practice there are only a few players generating Xids and in cases where they don't use 0, they use a fixed formatId that (thus far at least, thanks to good luck and a sparsely populated space) does not match those used by any other player.
Using a formatid other than 0 allows transactions managers to embed information into the Xid in a proprietary manner. JBossTS does this to e.g. encode a node id so that where more than one transaction manger instance is talking to a resource manager, it's possible to figure out which one owns the tx branch. Certain recovery operations rely on being able to parse the known Xid format and extract this information. A TM will, by default, handle recovery only for Xids that belong to it, in order to avoid treading on the toes of other instances that may be running. If it can't recognize the Xid structure based on formatId, it can't reliably parse the Xid to extract the node id, because in some foreign formats the bytes at that location may have totally different semantics. Thus it does not know which Xids it may safely take responsibility for.
So, removing the formatId check works in cases where you're just interested in comparing the Xid data bytes an an opaque binary blob, which happens to be the case in some recovery scenarios you've come across. MS is incorrectly assuming that this is all that's ever needed. However, in other scenarios you actually need that formatId information to parse the Xid data, and MS is not supplying it correctly, so recovery won't work. Murphy's law dictates that you won't find these corner cases until the worst possible moment, but trust me, they are lurking in there.
> The proposed class you mentioned earlier is in the later versions of JBossTS, however I'm using 4.6.1
The class was created as part of a refactoring for JBTM-723. In older versions the equivalent code is present as part of XARecoveryModule.java
Ok, thanks for the good explanation.
Thanks to Halliday and Musgrove for providing a suitable workaround to fix the issues we found. If there are others who had these problems with EAP5+SQL Server, here's a solution which I've so far tested in few environments (AIX/Solaris/Windows + SQL Server 2008 R2 Standard/Enterprise).
The issue was two-folded, one was a bug in the SQL Server's driver (what this thread was all about) and second one a bug in the AS, which causes too many transaction scans to occur if you have more than one XA-datasource to the same database server. The latter one is easy to fix, set <no-recover>true</no-recover> to your -ds.xml on one of the XA-datasources (leave the other one as it is) so there will be only one registered XA recovery source. This will not break recovering the other resource.
The JDBC-driver fix requires downloading Byteman from http://www.jboss.org/byteman and installing it to the same server as JBossAS. Then, somewhere insert a following script:
# Rule to fix MS SQL's JDBC driver's formatId impl to work with JBossTS
# Thanks to jhalliday & adinn from JBoss
RULE fix ms driver
And in the JBossAS start parameters (for example to run.conf) add the following (fix the paths):
# Byteman to fix MS SQL JDBC driver
There's other parameters available to Byteman also (such as listeners etc), but they're not necessary for this patch to work.
After these changes, I have so far encountered no problems with the recovery process and XA transactions with SQL Server.