1 2 3 Previous Next 77 Replies Latest reply on Oct 31, 2011 9:02 AM by tomjenkinson

    Remote txinflow: XID changes

    tomjenkinson

      Just to get the ball rolling on this part of this design discussion, we need to consider how best to update XidImple to allow the recovery manager to detect transactions managed by this node as the status quo would leave transactions that flowed to remote servers un-recoverable as the recovery manager checks the node name encoded in the XidImple would not have the same node ID as the local server. Also, branches passed to the XAResources would be the same branch and this could have undesired behavior in the XAResources.

       

      David suggests one approach:

       

        David Lloyd wrote:

      You could however use integer packing to make more room; in such a scheme, the bqual bits which identify the node would still be an integer sequence but would work like this:

      For this to work you'd have to treat 0000 as a special padding sequence for the case where only one nybble of the last byte is used.  Thus giving:

      0000: padding

      0001-0111: resources 0-6

      1000 0000-1011 1111: resources 7-70

      1100 0000 0000-1101 1111 1111: resources 71-582

      etc.

       

       

       

      Node names are currently recorded as variable length Strings, one option is to cat the new node name onto the end of the bqual resulting in:

       

      Xid {

         formatId

         gtrid - uid + nodename

         bqual - [uid (for the sequence number) + jndi name] + [nodename of the subordinate]

      }

       

      This will keep the bquals unique between TMs and will facitilitate recovery of this transaction.

       

      To be continued...

       

       

      EDIT

      ------

       

      Davids original messages can be found here and here.

        • 1. Re: Remote txinflow: XID changes
          tomjenkinson

          OK, the proposal is for the following structure:

           

          XID {

               formatid =131077 // INCREMENTED

               gtrid = <uid> <int:originatornodename> // IMMUTABLE

               bqual = <uid> <int:nodename> <int:parentnodename> <eisname> //MUTABLE

          }

           

          javax.transaction.xa.Xid states MAXBQUALSIZE is 64, our proposal would fit in this space <uid> is 28 bytes, node name is 4 bytes apiece (total 8) meaning bytes 36 -> 64 are available for the eisname

           

          The current structure is:

          XID {

               formatid =131076

               gtrid = <uid> <US-ASCII:node name>

               bqual = <uid> <US-ASCII:eisname>

          }

           

          This change will pave the way for us to add a further recovery filter to the XAResourceOrphanFilter recovery chain so that we can check the bqual for whether this XID is owned by the transaction manager.

           

          This will necessitate a change to the way that the AS is configured as currently it is expecting to confiugure node name on us as a string, therefore I propose to get this change in ASAP (i.e. put a 4.16.0 out for Monday and notify the AS team of our intentions). It will also need to be clear in the documentation that node names *MUST* be unique.

           

          Top level XIDs would set their parent node name to \0 so we would need to make sure that node names are greater than 0. Child nodes can bump the nodename slot in the bqual up to the parentnodename slow and put their own node name in the now empty nodename slot

           

          Comments/suggestions are certainly welcome!

          • 2. Re: Remote txinflow: XID changes
            tomjenkinson

            I have created https://issues.jboss.org/browse/JBTM-916 to model this. If it passes QA I will put the release out with it in on Monday, then get to work on https://issues.jboss.org/browse/JBTM-917

            • 3. Re: Remote txinflow: XID changes
              dmlloyd

              AS currently has a notion of a string nodename which extends not only to transactions but to remoting itself.  Are you proposing introducing a second concept of node name that is integer-based?

              • 4. Re: Remote txinflow: XID changes
                dmlloyd

                Also... why is a node name required at all in the branch qualifier?  If you simply used an empty bqual ID that consists only of the numeric resource # sequence like I illustrated, wouldn't you still be able to recover?  The bqual node name information is not useful in identifying the transactions to recover as far as I can see, since in the end each transaction coordinator is going to rely on the transport layer to restore any connections and talk to the subordinate(s) to get their list of transactions, and ultimately the transaction ownership would be determined by the gtid (right?), so the bqual information doesn't appear to add anything at all and merely needs to be unique per branch.  The worst I could imagine would be that the same node might list two branches as needing recovery... is this logic more subtle than I seem to think it is?

                • 5. Re: Remote txinflow: XID changes
                  tomjenkinson

                  Certainly we are adding this for remoting in the first instance so we want it to play nicely with that!

                   

                  One issue is the diamond question, e.g.

                   

                          -> TM2

                  TM1          -> TM4

                         -> TM3

                   

                   

                  If we don't know the parent then both TM2 and TM3 will potentially try to recover TM4 for all indoubt transactions related to transactions initiated at TM1 which will have undesirable recovery results.

                   

                  The other point is the bqual must have the EIS (JNDI) name in it as we have had introduced this functionality for customers before to assist them in debugging transaction related issues so we would definitely need to maintain this. That said, you were able to get 600 resources into 3 bytes (out of the 64 available) so I dont anticipate leaving space for the name being an issue (so long as we have a delimiter).

                   

                  Finally, I maybe am not understanding your proposal correctly, a worked example would help. What I can't quite understand is how the address space of the bqual will work in a pyramid.

                   

                  e.g:

                            -> TM2 -> RM1

                  TM1   -> TM3  -> RM2

                            -> RM3

                           

                  Can you let me know what the resource number of each of the TMs and RMs are in that picture so I can understand a little more clearly?

                  • 6. Re: Remote txinflow: XID changes
                    tomjenkinson

                    Also, even with a solution to the diamond/pyramid approach (which I guess is achieved because you pack the history into the bqual) for an example such as: 1000 0000 1000 0000, how would I know whether or not this was a sequence of two sevens, and not 712 (comprised of 128 for the binary +1 for the algorithm, +528 for the start of the range).

                     

                    With our proposed approach you do not need to maintain the history of nodes in the bqual as this can be determined from the top-down recovery scanning of parent/child node names packed in the bqual resulting in a constant space use in the bqual of two ints.

                     

                    Can we restrict the remoting node names down to values that can be converted uniquely to an int

                    • 7. Re: Remote txinflow: XID changes
                      dmlloyd

                      Tom Jenkinson wrote:

                       

                      Certainly we are adding this for remoting in the first instance so we want it to play nicely with that!

                       

                      One issue is the diamond question, e.g.

                       

                              -> TM2

                      TM1          -> TM4

                             -> TM3

                       

                       

                      If we don't know the parent then both TM2 and TM3 will potentially try to recover TM4 for all indoubt transactions related to transactions initiated at TM1 which will have undesirable recovery results.

                      Okay that makes sense, though isn't the parent info in the gtid?

                       

                       

                      Tom Jenkinson wrote:

                       

                      The other point is the bqual must have the EIS (JNDI) name in it as we have had introduced this functionality for customers before to assist them in debugging transaction related issues so we would definitely need to maintain this. That said, you were able to get 600 resources into 3 bytes (out of the 64 available) so I dont anticipate leaving space for the name being an issue (so long as we have a delimiter).

                      Ah okay.  It seems odd to me not to have that info just in the gtid - I suppose it's a space issue at that point?

                       

                       

                      Tom Jenkinson wrote:

                       

                      Finally, I maybe am not understanding your proposal correctly, a worked example would help. What I can't quite understand is how the address space of the bqual will work in a pyramid.

                       

                      e.g:

                                -> TM2 -> RM1

                      TM1   -> TM3  -> RM2

                                -> RM3

                               

                      Can you let me know what the resource number of each of the TMs and RMs are in that picture so I can understand a little more clearly?

                      I think it would be like this:

                      • TM1->TM2: 0001 0000
                      • TM2->RM1: 0001 0001
                      • TM1->TM3: 0010 0000
                      • TM3->RM2: 0010 0001
                      • TM1->RM3: 0011 0000

                       

                      The 0000 are there to pad it out to a byte.  Though this isn't actually a diamond.  Trying out a couple more just for the exercise...

                       

                      In your top diamond where there are just four TMs in a diamond...

                      • TM2's prefix is 0001
                      • TM3's prefix is 0010
                      • TM4's prefix depends on whether TM2 or TM3 inflowed the work in question, it could be 0001 0001 or 0010 0001.

                       

                      Tom Jenkinson wrote:

                       

                      Also, even with a solution to the diamond/pyramid approach (which I guess is achieved because you pack the history into the bqual) for an example such as: 1000 0000 1000 0000, how would I know whether or not this was a sequence of two sevens, and not 712 (comprised of 128 for the binary +1 for the algorithm, +528 for the start of the range).

                       

                      With our proposed approach you do not need to maintain the history of nodes in the bqual as this can be determined from the top-down recovery scanning of parent/child node names packed in the bqual resulting in a constant space use in the bqual of two ints.

                       

                      Can we restrict the remoting node names down to values that can be converted uniquely to an int

                       

                      Two sevens would be: 1000 0000 1000 0000, whereas 712 would be: 1110 0000 1000 0001

                       

                      The number of leading '1' bits tells you how many additional nybbles to read.  Put another way:

                      • 0xxx
                      • 10xx xxxx
                      • 110x xxxx xxxx
                      • 1110 xxxx xxxx xxxx
                      • 1111 0xxx xxxx xxxx xxxx
                      • 1111 10xx xxxx xxxx xxxx xxxx

                      etc., thus every sequence is unique, and every row has 8 times more possible values than the previous row.

                      • 8. Re: Remote txinflow: XID changes
                        dmlloyd

                        An alternative, but equivalent, encoding would be to mark each nybble with a "more" bit - if the bit is "1" then the next nybble has more bits, with the last nybble starting with 0:

                         

                        1001 1010 1000 0001 -> 001 010 000 001

                         

                        Same idea, really.

                        • 9. Re: Remote txinflow: XID changes
                          tomjenkinson

                          David Lloyd wrote:

                           

                          Tom Jenkinson wrote:

                           

                          Certainly we are adding this for remoting in the first instance so we want it to play nicely with that!

                           

                          One issue is the diamond question, e.g.

                           

                                  -> TM2

                          TM1          -> TM4

                                 -> TM3

                           

                           

                          If we don't know the parent then both TM2 and TM3 will potentially try to recover TM4 for all indoubt transactions related to transactions initiated at TM1 which will have undesirable recovery results.

                          Okay that makes sense, though isn't the parent info in the gtid?

                           

                          Only the root TM would be in the gtrid, we can't edit that as the transaction flows.

                           

                           

                          David Lloyd wrote:

                           

                           

                          Tom Jenkinson wrote:

                           

                          The other point is the bqual must have the EIS (JNDI) name in it as we have had introduced this functionality for customers before to assist them in debugging transaction related issues so we would definitely need to maintain this. That said, you were able to get 600 resources into 3 bytes (out of the 64 available) so I dont anticipate leaving space for the name being an issue (so long as we have a delimiter).

                          Ah okay.  It seems odd to me not to have that info just in the gtid - I suppose it's a space issue at that point?

                          Again, each RM will need to put its own name in the XID so again we can't cat this into the gtrid

                           

                           

                          David Lloyd wrote:

                           

                           

                           

                          Tom Jenkinson wrote:

                           

                          Finally, I maybe am not understanding your proposal correctly, a worked example would help. What I can't quite understand is how the address space of the bqual will work in a pyramid.

                           

                          e.g:

                                    -> TM2 -> RM1

                          TM1   -> TM3  -> RM2

                                    -> RM3

                                   

                          Can you let me know what the resource number of each of the TMs and RMs are in that picture so I can understand a little more clearly?

                          I think it would be like this:

                          • TM1->TM2: 0001 0000
                          • TM2->RM1: 0001 0001
                          • TM1->TM3: 0010 0000
                          • TM3->RM2: 0010 0001
                          • TM1->RM3: 0011 0000

                           

                          The 0000 are there to pad it out to a byte.  Though this isn't actually a diamond.  Trying out a couple more just for the exercise...

                           

                          In your top diamond where there are just four TMs in a diamond...

                          • TM2's prefix is 0001
                          • TM3's prefix is 0010
                          • TM4's prefix depends on whether TM2 or TM3 inflowed the work in question, it could be 0001 0001 or 0010 0001.

                           

                          Understood - thanks!

                          David Lloyd wrote:

                           

                           

                          Tom Jenkinson wrote:

                           

                          Also, even with a solution to the diamond/pyramid approach (which I guess is achieved because you pack the history into the bqual) for an example such as: 1000 0000 1000 0000, how would I know whether or not this was a sequence of two sevens, and not 712 (comprised of 128 for the binary +1 for the algorithm, +528 for the start of the range).

                           

                          With our proposed approach you do not need to maintain the history of nodes in the bqual as this can be determined from the top-down recovery scanning of parent/child node names packed in the bqual resulting in a constant space use in the bqual of two ints.

                           

                          Can we restrict the remoting node names down to values that can be converted uniquely to an int

                           

                          Two sevens would be: 1000 0000 1000 0000, whereas 712 would be: 1110 0000 1000 0001

                           

                          The number of leading '1' bits tells you how many additional nybbles to read.  Put another way:

                          • 0xxx
                          • 10xx xxxx
                          • 110x xxxx xxxx
                          • 1110 xxxx xxxx xxxx
                          • 1111 0xxx xxxx xxxx xxxx
                          • 1111 10xx xxxx xxxx xxxx xxxx

                          etc., thus every sequence is unique, and every row has 8 times more possible values than the previous row.

                           

                          Ah, understood. I was thinking the first few bits were binary, hence 100 could have been either read as 1 (for 1 trailing nybble) or 100 (for 4). Now I get you, the prefix is the number of nybbles, like in roman numerals.

                           

                          If it is 0 terminated as you show above though, it looks like 1111 10xx xxxx xxxx xxxx xxxx should really be 1111 10xx xxxx xxxx xxxx xxxx xxxx (five trailing nybbles AFTER the nybble where the counter prefix finishes? Or should it be read as the number of "total number of nybbles (less one) that the number is read from". E.g in your last case the number has been encoded in 5 plus 1 nybbles.

                           

                          Anyway, for the purposes of the discussion I get what you are proposing, I guess the question would be when it comes to recovery, when the node recovers how does it know which XIDs it should be looking for? With my proposal where the node is configured with a node ID via the CoreEnvironmentBean it can look in the XID for a match, in your scenario how would it know?

                           

                          Plus I fear we will run out of space as I think you are encoding the full path of TMs in the XID?

                          • 10. Re: Remote txinflow: XID changes
                            dmlloyd


                            Tom Jenkinson wrote:

                             

                            Only the root TM would be in the gtrid, we can't edit that as the transaction flows.

                             

                            But for recovery we only need the root TM right?  That would eliminate the risk of a subordinate TM recovering a branch of a transaction because it would recognize that it wasn't the root?

                             

                            Tom Jenkinson wrote:

                             

                            Again, each RM will need to put its own name in the XID so again we can't cat this into the gtrid.

                            Okay, so the gtid gets the root TM's name and the bqual gets the RM name (assuming such things have names)?

                             

                             

                            Tom Jenkinson wrote:

                             

                            Ah, understood. I was thinking the first few bits were binary, hence 100 could have been either read as 1 (for 1 trailing nybble) or 100 (for 4). Now I get you, the prefix is the number of nybbles, like in roman numerals.

                             

                            If it is 0 terminated as you show above though, it looks like 1111 10xx xxxx xxxx xxxx xxxx should really be 1111 10xx xxxx xxxx xxxx xxxx xxxx (five trailing nybbles AFTER the nybble where the counter prefix finishes? Or should it be read as the number of "total number of nybbles (less one) that the number is read from". E.g in your last case the number has been encoded in 5 plus 1 nybbles.

                            Well, I meant 5 nybbles holding data however after reviewing I guess that's not really correct.  It would be more correct to say "the number of nybbles in the total encoded number = the leading number of 1s plus 1".  With 0000 being a special case.

                             

                            Tom Jenkinson wrote:

                             

                            Anyway, for the purposes of the discussion I get what you are proposing, I guess the question would be when it comes to recovery, when the node recovers how does it know which XIDs it should be looking for? With my proposal where the node is configured with a node ID via the CoreEnvironmentBean it can look in the XID for a match, in your scenario how would it know?

                            Assuming that transactions should only be recovered by the root node which originated it, we would just use gtid to identify the "ownership" of the transaction.  Otherwise, all the known unrecovered XIDs from the node itself plus its subordinates should be propagated back upon recovery request.  I'd also need some kind of loop detection at the protocol level though.

                             

                            We could potentially optimize this process by using our knowledge of the XID format to filter the XID list by owner name before sending it back to the requestor.

                            • 11. Re: Remote txinflow: XID changes
                              dmlloyd

                              Tom Jenkinson wrote:

                               

                              Plus I fear we will run out of space as I think you are encoding the full path of TMs in the XID?

                              Just the numeric path, which is why I recommended a compact notation.  But yeah ultimately the tree depth and breadth is limited by the amount of bqual space using my scheme.

                              • 12. Re: Remote txinflow: XID changes
                                tomjenkinson

                                I think I understand your notation approach now and can understand how it might work (assuming we can determine a delimiter to use to seperate the path from the eis name [typically the JNDI name]).

                                 

                                That said, and with the restriction in mind that you have outlined below, are you sure that managing node names as ints rather than strings is an overhead to the deployer that is "totally unacceptable" (my words, not yours)? I have attached a "git diff" of the transaction work which shows where I would anticipate the AS changes required, you will note that currently the node name is being hard coded as "1", I have just hard coded it as 1 instead in my patch but I would suggest removing the hardcoding of the name from the Java code and hard code it in the XML instead to make it clear that a user must change it if they have more than one node.

                                 

                                As of last night I have (or indeed, had, as I did it before this part of the conversation resumed) put back a fix to TS that converts the XID with the parent and child encoded in the bqual, I can easily back this out and go with your approach but I see at least the following advantage to mine:

                                 

                                The number of nodes in the tree is not bounded by 64 bytes as we only encode the node of the subordinate and its parent

                                 

                                The drawback is that an administrator would need to configure each AS with a remoting node name, plus a TS integer node name.

                                David Lloyd wrote:

                                 

                                Tom Jenkinson wrote:

                                 

                                Plus I fear we will run out of space as I think you are encoding the full path of TMs in the XID?

                                Just the numeric path, which is why I recommended a compact notation.  But yeah ultimately the tree depth and breadth is limited by the amount of bqual space using my scheme.

                                 

                                Just trying to get some definitive clarity to prevent me having to rewrite this thing again. But if you go with my current approach "We could potentially optimize this process by using our knowledge of the XID format to filter the XID list by owner name before sending it back to the requestor." will be handled by the XID.

                                • 13. Re: Remote txinflow: XID changes
                                  tomjenkinson

                                  git diff attached sorry!

                                  • 14. Re: Remote txinflow: XID changes
                                    dmlloyd

                                    Tom Jenkinson wrote:

                                     

                                    I think I understand your notation approach now and can understand how it might work (assuming we can determine a delimiter to use to seperate the path from the eis name [typically the JNDI name]).

                                     

                                    That said, and with the restriction in mind that you have outlined below, are you sure that managing node names as ints rather than strings is an overhead to the deployer that is "totally unacceptable" (my words, not yours)? I have attached a "git diff" of the transaction work which shows where I would anticipate the AS changes required, you will note that currently the node name is being hard coded as "1", I have just hard coded it as 1 instead in my patch but I would suggest removing the hardcoding of the name from the Java code and hard code it in the XML instead to make it clear that a user must change it if they have more than one node.

                                    When I say "node name" I'm generally referring to the JBossAS/Remoting concept of a name for the instance itself, which is by default derived from the host name of the system (but is modifiable in the event of, say, multiple instances running on one host).  But this node name is basically of arbitrary length, so encoding it into bqual (as-is) is not an option no matter what.  That said, it would be nice if TS and AS had the same notion of node name.

                                     

                                    As far as int IDs go, forcing the user to define a unique integer ID for each possible participant is indeed probably too great an administrative/management burden, and it's not terribly user-friendly for reading purposes either; at least, I don't relish selling the idea to support or the community (nor trying to explain it for that matter).  But, my approach was based on the sole requirement that any node in the hierarchy should be able to independently generate an ID which is definitely unique globally for that gtid.  So it doesn't really take into account other requirements, such as puting human-readable debug info in the XID, in the first place.

                                     

                                     

                                    Tom Jenkinson wrote:

                                     

                                    As of last night I have (or indeed, had, as I did it before this part of the conversation resumed) put back a fix to TS that converts the XID with the parent and child encoded in the bqual, I can easily back this out and go with your approach but I see at least the following advantage to mine:

                                     

                                    The number of nodes in the tree is not bounded by 64 bytes as we only encode the node of the subordinate and its parent

                                     

                                    The drawback is that an administrator would need to configure each AS with a remoting node name, plus a TS integer node name.

                                    Well... that could still be a viable approach even with string names, maybe.  With 64 bytes to fill but only one or two names to fill it with, it seems somewhat more realistic than trying to store an entire chain of names (even if the name ends up being an abbreviated form of the system node name).  Recovery can't really take advantage of that information as robustly though (to use that information for recovery you'd have to restore not only all the nodes but all the original paths between them, even if they are redundant), so other than debug purposes, there doesn't seem to be a lot of advantage.  One other possible weakness is that if two nodes have the same identifier, this is one more place where that can cause a problem (granted the user really shouldn't do that).  Also, while a node typically knows what nodes it is talking to, it doesn't know what nodes are talking to it on the inflow side (other than by whatever info we stuff in the XID).

                                     

                                    One characteristic of my approach which might turn out to be a weakness is that the resource IDs would essentially be numbered sequentially based on the order in which they enrolled in the transaction, so the numbering isn't really relevant in any real sense outside of it.  Now theoretically, it seems that a node could tell which XIDs were initiated by it based on what unrecovered XIDs the local TM has pending, and doing a prefix comparison.  So it could use that information to resolve the whole transaction branch while only sending its own corresponding XID to the parent (which it would only do if all of the known immediate subordinate XIDs were known to be available for resolution).  I don't know if this holds up for cyclic graphs though - if re-entered nodes don't interally combine the transactions with the same gtid then it'd probably work fine I guess (re-entered nodes could maybe optimize such transactions by internally always using the shortest, lowest, or least recent XID bqual available that it knows about for a given gtid, instead of the one that was inflowed, which could avoid a "split-brain" kind of recovery situation).

                                     

                                    I guess I don't know enough about the subtleties of the recovery process to say one way or the other with any authority, other than to just give a brain dump of thoughts as they arrive.

                                    1 2 3 Previous Next