1 2 3 … Previous Next 77 Replies Latest reply on Oct 31, 2011 9:02 AM by tomjenkinson

Remote txinflow: XID changes

tomjenkinson Oct 7, 2011 5:31 AM

Just to get the ball rolling on this part of this design discussion, we need to consider how best to update XidImple to allow the recovery manager to detect transactions managed by this node as the status quo would leave transactions that flowed to remote servers un-recoverable as the recovery manager checks the node name encoded in the XidImple would not have the same node ID as the local server. Also, branches passed to the XAResources would be the same branch and this could have undesired behavior in the XAResources.

David suggests one approach:

David Lloyd wrote:

You could however use integer packing to make more room; in such a scheme, the bqual bits which identify the node would still be an integer sequence but would work like this:

For this to work you'd have to treat 0000 as a special padding sequence for the case where only one nybble of the last byte is used. Thus giving:

0000: padding

0001-0111: resources 0-6

1000 0000-1011 1111: resources 7-70

1100 0000 0000-1101 1111 1111: resources 71-582

etc.

Node names are currently recorded as variable length Strings, one option is to cat the new node name onto the end of the bqual resulting in:

Xid {

formatId

gtrid - uid + nodename

bqual - [uid (for the sequence number) + jndi name] + [nodename of the subordinate]

}

This will keep the bquals unique between TMs and will facitilitate recovery of this transaction.

To be continued...

EDIT

------

Davids original messages can be found here and here.

1. Re: Remote txinflow: XID changes

tomjenkinson Oct 5, 2011 8:31 AM (in response to tomjenkinson)

OK, the proposal is for the following structure:

XID {
     formatid =131077 // INCREMENTED
     gtrid = <uid> <int:originatornodename> // IMMUTABLE
     bqual = <uid> <int:nodename> <int:parentnodename> <eisname> //MUTABLE
}

javax.transaction.xa.Xid states MAXBQUALSIZE is 64, our proposal would fit in this space <uid> is 28 bytes, node name is 4 bytes apiece (total 8) meaning bytes 36 -> 64 are available for the eisname

The current structure is:
XID {
     formatid =131076
     gtrid = <uid> <US-ASCII:node name>
     bqual = <uid> <US-ASCII:eisname>
}

This change will pave the way for us to add a further recovery filter to the XAResourceOrphanFilter recovery chain so that we can check the bqual for whether this XID is owned by the transaction manager.

This will necessitate a change to the way that the AS is configured as currently it is expecting to confiugure node name on us as a string, therefore I propose to get this change in ASAP (i.e. put a 4.16.0 out for Monday and notify the AS team of our intentions). It will also need to be clear in the documentation that node names *MUST* be unique.

Top level XIDs would set their parent node name to \0 so we would need to make sure that node names are greater than 0. Child nodes can bump the nodename slot in the bqual up to the parentnodename slow and put their own node name in the now empty nodename slot

Comments/suggestions are certainly welcome!
Actions
2. Re: Remote txinflow: XID changes

tomjenkinson Oct 6, 2011 10:06 AM (in response to tomjenkinson)

I have created https://issues.jboss.org/browse/JBTM-916 to model this. If it passes QA I will put the release out with it in on Monday, then get to work on https://issues.jboss.org/browse/JBTM-917
Actions
3. Re: Remote txinflow: XID changes

dmlloyd Oct 6, 2011 9:33 PM (in response to tomjenkinson)

AS currently has a notion of a string nodename which extends not only to transactions but to remoting itself. Are you proposing introducing a second concept of node name that is integer-based?
Actions
4. Re: Remote txinflow: XID changes

dmlloyd Oct 7, 2011 12:50 AM (in response to dmlloyd)

Also... why is a node name required at all in the branch qualifier? If you simply used an empty bqual ID that consists only of the numeric resource # sequence like I illustrated, wouldn't you still be able to recover? The bqual node name information is not useful in identifying the transactions to recover as far as I can see, since in the end each transaction coordinator is going to rely on the transport layer to restore any connections and talk to the subordinate(s) to get their list of transactions, and ultimately the transaction ownership would be determined by the gtid (right?), so the bqual information doesn't appear to add anything at all and merely needs to be unique per branch. The worst I could imagine would be that the same node might list two branches as needing recovery... is this logic more subtle than I seem to think it is?
Actions
5. Re: Remote txinflow: XID changes

tomjenkinson Oct 7, 2011 4:46 AM (in response to dmlloyd)

Certainly we are adding this for remoting in the first instance so we want it to play nicely with that!

One issue is the diamond question, e.g.

        -> TM2
TM1          -> TM4
       -> TM3

If we don't know the parent then both TM2 and TM3 will potentially try to recover TM4 for all indoubt transactions related to transactions initiated at TM1 which will have undesirable recovery results.

The other point is the bqual must have the EIS (JNDI) name in it as we have had introduced this functionality for customers before to assist them in debugging transaction related issues so we would definitely need to maintain this. That said, you were able to get 600 resources into 3 bytes (out of the 64 available) so I dont anticipate leaving space for the name being an issue (so long as we have a delimiter).

Finally, I maybe am not understanding your proposal correctly, a worked example would help. What I can't quite understand is how the address space of the bqual will work in a pyramid.

e.g:
          -> TM2 -> RM1
TM1   -> TM3 -> RM2
          -> RM3

Can you let me know what the resource number of each of the TMs and RMs are in that picture so I can understand a little more clearly?
Actions
6. Re: Remote txinflow: XID changes

tomjenkinson Oct 7, 2011 7:25 AM (in response to tomjenkinson)

Also, even with a solution to the diamond/pyramid approach (which I guess is achieved because you pack the history into the bqual) for an example such as: 1000 0000 1000 0000, how would I know whether or not this was a sequence of two sevens, and not 712 (comprised of 128 for the binary +1 for the algorithm, +528 for the start of the range).

With our proposed approach you do not need to maintain the history of nodes in the bqual as this can be determined from the top-down recovery scanning of parent/child node names packed in the bqual resulting in a constant space use in the bqual of two ints.

Can we restrict the remoting node names down to values that can be converted uniquely to an int
Actions
7. Re: Remote txinflow: XID changes

dmlloyd Oct 7, 2011 9:27 AM (in response to tomjenkinson)
Tom Jenkinson wrote:

Certainly we are adding this for remoting in the first instance so we want it to play nicely with that!

One issue is the diamond question, e.g.

        -> TM2
TM1          -> TM4
       -> TM3

If we don't know the parent then both TM2 and TM3 will potentially try to recover TM4 for all indoubt transactions related to transactions initiated at TM1 which will have undesirable recovery results.
Okay that makes sense, though isn't the parent info in the gtid?

Tom Jenkinson wrote:

The other point is the bqual must have the EIS (JNDI) name in it as we have had introduced this functionality for customers before to assist them in debugging transaction related issues so we would definitely need to maintain this. That said, you were able to get 600 resources into 3 bytes (out of the 64 available) so I dont anticipate leaving space for the name being an issue (so long as we have a delimiter).
Ah okay. It seems odd to me not to have that info just in the gtid - I suppose it's a space issue at that point?

Tom Jenkinson wrote:

Finally, I maybe am not understanding your proposal correctly, a worked example would help. What I can't quite understand is how the address space of the bqual will work in a pyramid.

e.g:
          -> TM2 -> RM1
TM1   -> TM3 -> RM2
          -> RM3

Can you let me know what the resource number of each of the TMs and RMs are in that picture so I can understand a little more clearly?
I think it would be like this:
TM1->TM2: 0001 0000
TM2->RM1: 0001 0001
TM1->TM3: 0010 0000
TM3->RM2: 0010 0001
TM1->RM3: 0011 0000

The 0000 are there to pad it out to a byte. Though this isn't actually a diamond. Trying out a couple more just for the exercise...

In your top diamond where there are just four TMs in a diamond...
TM2's prefix is 0001
TM3's prefix is 0010
TM4's prefix depends on whether TM2 or TM3 inflowed the work in question, it could be 0001 0001 or 0010 0001.

Tom Jenkinson wrote:

Also, even with a solution to the diamond/pyramid approach (which I guess is achieved because you pack the history into the bqual) for an example such as: 1000 0000 1000 0000, how would I know whether or not this was a sequence of two sevens, and not 712 (comprised of 128 for the binary +1 for the algorithm, +528 for the start of the range).

With our proposed approach you do not need to maintain the history of nodes in the bqual as this can be determined from the top-down recovery scanning of parent/child node names packed in the bqual resulting in a constant space use in the bqual of two ints.

Can we restrict the remoting node names down to values that can be converted uniquely to an int

Two sevens would be: 1000 0000 1000 0000, whereas 712 would be: 1110 0000 1000 0001

The number of leading '1' bits tells you how many additional nybbles to read. Put another way:
0xxx
10xx xxxx
110x xxxx xxxx
1110 xxxx xxxx xxxx
1111 0xxx xxxx xxxx xxxx
1111 10xx xxxx xxxx xxxx xxxx
etc., thus every sequence is unique, and every row has 8 times more possible values than the previous row.
Actions
8. Re: Remote txinflow: XID changes

dmlloyd Oct 7, 2011 9:35 AM (in response to dmlloyd)

An alternative, but equivalent, encoding would be to mark each nybble with a "more" bit - if the bit is "1" then the next nybble has more bits, with the last nybble starting with 0:

1001 1010 1000 0001 -> 001 010 000 001

Same idea, really.
Actions
9. Re: Remote txinflow: XID changes

tomjenkinson Oct 7, 2011 9:47 AM (in response to dmlloyd)
David Lloyd wrote:

Tom Jenkinson wrote:

Certainly we are adding this for remoting in the first instance so we want it to play nicely with that!

One issue is the diamond question, e.g.

        -> TM2
TM1          -> TM4
       -> TM3

If we don't know the parent then both TM2 and TM3 will potentially try to recover TM4 for all indoubt transactions related to transactions initiated at TM1 which will have undesirable recovery results.
Okay that makes sense, though isn't the parent info in the gtid?

Only the root TM would be in the gtrid, we can't edit that as the transaction flows.

David Lloyd wrote:

Tom Jenkinson wrote:

The other point is the bqual must have the EIS (JNDI) name in it as we have had introduced this functionality for customers before to assist them in debugging transaction related issues so we would definitely need to maintain this. That said, you were able to get 600 resources into 3 bytes (out of the 64 available) so I dont anticipate leaving space for the name being an issue (so long as we have a delimiter).
Ah okay. It seems odd to me not to have that info just in the gtid - I suppose it's a space issue at that point?
Again, each RM will need to put its own name in the XID so again we can't cat this into the gtrid

David Lloyd wrote:

Tom Jenkinson wrote:

Finally, I maybe am not understanding your proposal correctly, a worked example would help. What I can't quite understand is how the address space of the bqual will work in a pyramid.

e.g:
          -> TM2 -> RM1
TM1   -> TM3 -> RM2
          -> RM3

Can you let me know what the resource number of each of the TMs and RMs are in that picture so I can understand a little more clearly?
I think it would be like this:
TM1->TM2: 0001 0000
TM2->RM1: 0001 0001
TM1->TM3: 0010 0000
TM3->RM2: 0010 0001
TM1->RM3: 0011 0000

The 0000 are there to pad it out to a byte. Though this isn't actually a diamond. Trying out a couple more just for the exercise...

In your top diamond where there are just four TMs in a diamond...
TM2's prefix is 0001
TM3's prefix is 0010
TM4's prefix depends on whether TM2 or TM3 inflowed the work in question, it could be 0001 0001 or 0010 0001.

Understood - thanks!
David Lloyd wrote:

Tom Jenkinson wrote:

Also, even with a solution to the diamond/pyramid approach (which I guess is achieved because you pack the history into the bqual) for an example such as: 1000 0000 1000 0000, how would I know whether or not this was a sequence of two sevens, and not 712 (comprised of 128 for the binary +1 for the algorithm, +528 for the start of the range).

With our proposed approach you do not need to maintain the history of nodes in the bqual as this can be determined from the top-down recovery scanning of parent/child node names packed in the bqual resulting in a constant space use in the bqual of two ints.

Can we restrict the remoting node names down to values that can be converted uniquely to an int

Two sevens would be: 1000 0000 1000 0000, whereas 712 would be: 1110 0000 1000 0001

The number of leading '1' bits tells you how many additional nybbles to read. Put another way:
0xxx
10xx xxxx
110x xxxx xxxx
1110 xxxx xxxx xxxx
1111 0xxx xxxx xxxx xxxx
1111 10xx xxxx xxxx xxxx xxxx
etc., thus every sequence is unique, and every row has 8 times more possible values than the previous row.

Ah, understood. I was thinking the first few bits were binary, hence 100 could have been either read as 1 (for 1 trailing nybble) or 100 (for 4). Now I get you, the prefix is the number of nybbles, like in roman numerals.

If it is 0 terminated as you show above though, it looks like 1111 10xx xxxx xxxx xxxx xxxx should really be 1111 10xx xxxx xxxx xxxx xxxx xxxx (five trailing nybbles AFTER the nybble where the counter prefix finishes? Or should it be read as the number of "total number of nybbles (less one) that the number is read from". E.g in your last case the number has been encoded in 5 plus 1 nybbles.

Anyway, for the purposes of the discussion I get what you are proposing, I guess the question would be when it comes to recovery, when the node recovers how does it know which XIDs it should be looking for? With my proposal where the node is configured with a node ID via the CoreEnvironmentBean it can look in the XID for a match, in your scenario how would it know?

Plus I fear we will run out of space as I think you are encoding the full path of TMs in the XID?
Actions
10. Re: Remote txinflow: XID changes

dmlloyd Oct 7, 2011 10:14 AM (in response to tomjenkinson)

Tom Jenkinson wrote:

Only the root TM would be in the gtrid, we can't edit that as the transaction flows.

But for recovery we only need the root TM right? That would eliminate the risk of a subordinate TM recovering a branch of a transaction because it would recognize that it wasn't the root?

Tom Jenkinson wrote:

Again, each RM will need to put its own name in the XID so again we can't cat this into the gtrid.
Okay, so the gtid gets the root TM's name and the bqual gets the RM name (assuming such things have names)?

Tom Jenkinson wrote:

Ah, understood. I was thinking the first few bits were binary, hence 100 could have been either read as 1 (for 1 trailing nybble) or 100 (for 4). Now I get you, the prefix is the number of nybbles, like in roman numerals.

If it is 0 terminated as you show above though, it looks like 1111 10xx xxxx xxxx xxxx xxxx should really be 1111 10xx xxxx xxxx xxxx xxxx xxxx (five trailing nybbles AFTER the nybble where the counter prefix finishes? Or should it be read as the number of "total number of nybbles (less one) that the number is read from". E.g in your last case the number has been encoded in 5 plus 1 nybbles.
Well, I meant 5 nybbles holding data however after reviewing I guess that's not really correct. It would be more correct to say "the number of nybbles in the total encoded number = the leading number of 1s plus 1". With 0000 being a special case.

Tom Jenkinson wrote:

Anyway, for the purposes of the discussion I get what you are proposing, I guess the question would be when it comes to recovery, when the node recovers how does it know which XIDs it should be looking for? With my proposal where the node is configured with a node ID via the CoreEnvironmentBean it can look in the XID for a match, in your scenario how would it know?
Assuming that transactions should only be recovered by the root node which originated it, we would just use gtid to identify the "ownership" of the transaction. Otherwise, all the known unrecovered XIDs from the node itself plus its subordinates should be propagated back upon recovery request. I'd also need some kind of loop detection at the protocol level though.

We could potentially optimize this process by using our knowledge of the XID format to filter the XID list by owner name before sending it back to the requestor.
Actions
11. Re: Remote txinflow: XID changes

dmlloyd Oct 7, 2011 10:16 AM (in response to tomjenkinson)

Tom Jenkinson wrote:

Plus I fear we will run out of space as I think you are encoding the full path of TMs in the XID?
Just the numeric path, which is why I recommended a compact notation. But yeah ultimately the tree depth and breadth is limited by the amount of bqual space using my scheme.
Actions
12. Re: Remote txinflow: XID changes

tomjenkinson Oct 7, 2011 12:59 PM (in response to dmlloyd)

I think I understand your notation approach now and can understand how it might work (assuming we can determine a delimiter to use to seperate the path from the eis name [typically the JNDI name]).

That said, and with the restriction in mind that you have outlined below, are you sure that managing node names as ints rather than strings is an overhead to the deployer that is "totally unacceptable" (my words, not yours)? I have attached a "git diff" of the transaction work which shows where I would anticipate the AS changes required, you will note that currently the node name is being hard coded as "1", I have just hard coded it as 1 instead in my patch but I would suggest removing the hardcoding of the name from the Java code and hard code it in the XML instead to make it clear that a user must change it if they have more than one node.

As of last night I have (or indeed, had, as I did it before this part of the conversation resumed) put back a fix to TS that converts the XID with the parent and child encoded in the bqual, I can easily back this out and go with your approach but I see at least the following advantage to mine:

The number of nodes in the tree is not bounded by 64 bytes as we only encode the node of the subordinate and its parent

The drawback is that an administrator would need to configure each AS with a remoting node name, plus a TS integer node name.
David Lloyd wrote:

Tom Jenkinson wrote:

Plus I fear we will run out of space as I think you are encoding the full path of TMs in the XID?
Just the numeric path, which is why I recommended a compact notation. But yeah ultimately the tree depth and breadth is limited by the amount of bqual space using my scheme.

Just trying to get some definitive clarity to prevent me having to rewrite this thing again. But if you go with my current approach "We could potentially optimize this process by using our knowledge of the XID format to filter the XID list by owner name before sending it back to the requestor." will be handled by the XID.
Actions
13. Re: Remote txinflow: XID changes

tomjenkinson Oct 7, 2011 1:01 PM (in response to tomjenkinson)
git diff attached sorry!

patch.zip 1.5 KB
Actions
14. Re: Remote txinflow: XID changes

dmlloyd Oct 7, 2011 3:49 PM (in response to tomjenkinson)

Tom Jenkinson wrote:

I think I understand your notation approach now and can understand how it might work (assuming we can determine a delimiter to use to seperate the path from the eis name [typically the JNDI name]).

That said, and with the restriction in mind that you have outlined below, are you sure that managing node names as ints rather than strings is an overhead to the deployer that is "totally unacceptable" (my words, not yours)? I have attached a "git diff" of the transaction work which shows where I would anticipate the AS changes required, you will note that currently the node name is being hard coded as "1", I have just hard coded it as 1 instead in my patch but I would suggest removing the hardcoding of the name from the Java code and hard code it in the XML instead to make it clear that a user must change it if they have more than one node.
When I say "node name" I'm generally referring to the JBossAS/Remoting concept of a name for the instance itself, which is by default derived from the host name of the system (but is modifiable in the event of, say, multiple instances running on one host). But this node name is basically of arbitrary length, so encoding it into bqual (as-is) is not an option no matter what. That said, it would be nice if TS and AS had the same notion of node name.

As far as int IDs go, forcing the user to define a unique integer ID for each possible participant is indeed probably too great an administrative/management burden, and it's not terribly user-friendly for reading purposes either; at least, I don't relish selling the idea to support or the community (nor trying to explain it for that matter). But, my approach was based on the sole requirement that any node in the hierarchy should be able to independently generate an ID which is definitely unique globally for that gtid. So it doesn't really take into account other requirements, such as puting human-readable debug info in the XID, in the first place.

Tom Jenkinson wrote:

As of last night I have (or indeed, had, as I did it before this part of the conversation resumed) put back a fix to TS that converts the XID with the parent and child encoded in the bqual, I can easily back this out and go with your approach but I see at least the following advantage to mine:

The number of nodes in the tree is not bounded by 64 bytes as we only encode the node of the subordinate and its parent

The drawback is that an administrator would need to configure each AS with a remoting node name, plus a TS integer node name.
Well... that could still be a viable approach even with string names, maybe. With 64 bytes to fill but only one or two names to fill it with, it seems somewhat more realistic than trying to store an entire chain of names (even if the name ends up being an abbreviated form of the system node name). Recovery can't really take advantage of that information as robustly though (to use that information for recovery you'd have to restore not only all the nodes but all the original paths between them, even if they are redundant), so other than debug purposes, there doesn't seem to be a lot of advantage. One other possible weakness is that if two nodes have the same identifier, this is one more place where that can cause a problem (granted the user really shouldn't do that). Also, while a node typically knows what nodes it is talking to, it doesn't know what nodes are talking to it on the inflow side (other than by whatever info we stuff in the XID).

One characteristic of my approach which might turn out to be a weakness is that the resource IDs would essentially be numbered sequentially based on the order in which they enrolled in the transaction, so the numbering isn't really relevant in any real sense outside of it. Now theoretically, it seems that a node could tell which XIDs were initiated by it based on what unrecovered XIDs the local TM has pending, and doing a prefix comparison. So it could use that information to resolve the whole transaction branch while only sending its own corresponding XID to the parent (which it would only do if all of the known immediate subordinate XIDs were known to be available for resolution). I don't know if this holds up for cyclic graphs though - if re-entered nodes don't interally combine the transactions with the same gtid then it'd probably work fine I guess (re-entered nodes could maybe optimize such transactions by internally always using the shortest, lowest, or least recent XID bqual available that it knows about for a given gtid, instead of the one that was inflowed, which could avoid a "split-brain" kind of recovery situation).

I guess I don't know enough about the subtleties of the recovery process to say one way or the other with any authority, other than to just give a brain dump of thoughts as they arrive.
Actions

1 2 3 … Previous Next

Go to original post