13 Replies Latest reply on Oct 7, 2009 11:28 AM by marklittle

    Transaction timeouts

    timfox

      Continued from the JBM dev forum, posted by Jeff:

      http://www.jboss.org/index.html?module=bb&op=viewtopic&p=4257347#4257347

      'jmesnil' wrote:



      to sum up,

      We return false from setTransactionTimeout and use our own tx timeout handler
      We only rollback non-prepared tx when they timed out (with a default tx timeout of 5 mins)
      A prepared tx will *not* be checked for timeout. It is up to the admin to decide a heuristic completion
      (we offer management operations to commit/rollback a prepared tx).

      Related to this, XAResource.recover() javadoc specifies to return prepared tx *and* heuristically
      completed tx.
      Currently, we return only the prepared tx.
      Once [link]https://jira.jboss.org/jira/browse/HORNETQ-33[/link] is fixed, we will be able to return the heuristically completed tx.
      I'm curious to know how the TM will handle the heuristically completed tx returned by recover()...


        • 1. Re: Transaction timeouts
          timfox

          A few questions:

          1) Why do we return false from setTransactionTimeout - shouldn't we use any tx mgr provided value?

          2) We only rollback non prepared transactions? This seems wrong to me. Transactions might be very long lived - they could last for days - we can't rollback non prepared txs.

          3) Prepared txs will not be timed out - I thought this was the whole point? Can we check this with Jonathan Halliday?

          • 2. Re: Transaction timeouts
            jmesnil

             

            "timfox" wrote:
            A few questions:

            1) Why do we return false from setTransactionTimeout - shouldn't we use any tx mgr provided value?


            You're right, since we accept the value from the TM, we must return true.

            "timfox" wrote:

            2) We only rollback non prepared transactions? This seems wrong to me. Transactions might be very long lived - they could last for days - we can't rollback non prepared txs.


            I had a discussion with Jonathan about it (see IRC logs below).
            We agreed that only unprepared tx must be rolled back as the spec prohibits an RM from unilaterally rolling back / timing out a prepared branch.
            As for long-lived transactions, Jonathan says that "tx can hold exclusive locks, so more than a few minutes has the potential to cripple throughput".
            JBoss AS default value for tx timeout is 5 mins.


            "timfox" wrote:

            3) Prepared txs will not be timed out - I thought this was the whole point? Can we check this with Jonathan Halliday?


            Yes, we agreed that prepared tx must no be timed out. Again, the spec prohibits an RM from unilaterally rolling back / timing out a prepared branch.
            Once the tx has been prepared, we must keep it and either the TM will complete it or the admin will have to heuristically complete it.


            jmesnil: jhalliday: do you have 5 mins to talk about tx timeout?
            [15:44] jmesnil: jhalliday: I'm figuring out how to handle tx timeout in hornetq
            [15:44] jmesnil: there is a bit of confusion as you can see from the dev forum
            [15:45] jmesnil: jhalliday: my understanding was that the TM informs the RM of a tx timeout. This value can be used by the RM to rollback transactions
            [15:45] jmesnil: jhalliday: imho, it makes sense to only rollback unprepared tx to avoid any heuristic completion. do you agree?
            [15:48] jhalliday: sure, the spec prohibits an RM from unilaterally rolling back / timing out a prepared branch.
            [15:48] jmesnil: jhalliday: ok, good
            [15:49] jmesnil: jhalliday: so we can't do much once a tx has been prepared. We need to keep it as is and either the TM will complete it or an admin will heuristically complete it
            [15:50] jhalliday: yup
            [15:50] jmesnil: jhalliday: what is a good sensible default value for the tx timeout? I thought something long (1 day?) could be good enough to allow long-lived tx
            [15:51] jmesnil: jhalliday: what the default value for the TM? I found 5 mins in JBoss AS doc but I'm not sure if it is the correct default
            [15:52] jhalliday: 300 seconds in AS. tx can hold exclusive locks, so more than a few minutes has the potential to cripple throughput.
            [15:53] jmesnil: jhalliday: no wonder Pat Helland calls 2PC the "Anti-availability" protocol
            [15:53] jmesnil: jhalliday: ok, I'll use the defaut to keep it consistent
            [15:54] jmesnil: jhalliday: 1 last question and you can go back to haranguing management
            [15:54] jmesnil: jhalliday: XAResource.recover() must return prepared tx and heuristically completed tx
            [15:54] jhalliday: yup
            [15:54] jmesnil: jhalliday: in hornetq, we return only prepared tx at the moment
            [15:55] jmesnil: jhalliday: what does the TM do with the XID of the heuristically completed tx?
            [15:56] jhalliday: the information is matched against the tx logs to determine tx where the outcome is inconsistent and log warnings about them.
            [15:56] jmesnil: jhalliday: ok I see. So the RM must keep these XIDs until the TM tells it to forget() it, right?
            [15:57] jhalliday: yes. for tx where the unilateral decision of the RM actually happens to match that of the TM you should get a forget() more or less straight away. for the genuine heuristics it will need manual cleanup.
            [15:58] jmesnil: jhalliday: that makes sense
            [15:58] jmesnil: jhalliday: thanks for thelp!
            [15:58] jhalliday: no problem

            • 3. Re: Transaction timeouts
              timfox

              I don't agree with the timeout.

              It's totally legal and perfectly normal for me, as a messaging user, to start a transaction then, say, send 1 message every hour for 3 days then commit it. What's wrong with that?

              We can't have transactions like that being rolled back.

              Also, what are these locks? I don't know of them.

              • 4. Re: Transaction timeouts
                jmesnil

                 

                "timfox" wrote:
                I don't agree with the timeout.

                It's totally legal and perfectly normal for me, as a messaging user, to start a transaction then, say, send 1 message every hour for 3 days then commit it. What's wrong with that?


                If you have a 3-day transaction, this means you will have to "freeze" all the tx participants for 3 days to preserve ACID. This could imply locks, snapshots, etc. from other participants, we don't know. From a DB pov (the most likely other participant), this seems ineffective.
                If you have such a timespan, maybe you'd be better using something else than 2PC (e.g. compensation)

                "timfox" wrote:

                We can't have transactions like that being rolled back.


                If the user knows he will have such long tx, he will have to set an appropriate default tx timeout value (e.g. 1 week)




                • 5. Re: Transaction timeouts
                  timfox

                  I don't get it - why is this ineffective?

                  The user might have a database that they update once per transaction and that needs to be in the same tx as their send tx which might take 1 week - what is wrong with that?

                  I think it's very bad if we start rolling back users transactions for what seems like a perfectly normal use case.

                  • 6. Re: Transaction timeouts
                    timfox

                    There's also another problem with the transaction timeout implementation currently.

                    The transaction timeout should be measured from last update time, not creation time.

                    • 7. Re: Transaction timeouts
                      jmesnil

                       

                      "timfox" wrote:
                      I don't get it - why is this ineffective?


                      http://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf
                      http://blogs.msdn.com/pathelland/archive/2007/05/20/soa-and-newton-s-universe.aspx

                      "timfox" wrote:

                      The user might have a database that they update once per transaction and that needs to be in the same tx as their send tx which might take 1 week - what is wrong with that?


                      Freezing a DB snapshot for a week does not seem a good idea to me but ymmv.


                      "timfox" wrote:

                      I think it's very bad if we start rolling back users transactions for what seems like a perfectly normal use case.


                      So what options do we have?

                      * no tx timeout => we keep tx undefinitely until the TM completes them
                      * a timeout => => we rollback any *unprepared* tx if they are idle for longer than the timeout

                      AIUI, JBoss AS' tx timeout is 5 mins as is Oracle WebLogic.



                      • 8. Re: Transaction timeouts
                        timfox

                        The semantics of XA timeouts seems incredibly poorly specced in both the XA and JTA specs.

                        In fact, I can find no discussion *at all* of the semantics of XA timeout in either of those specs!

                        • 9. Re: Transaction timeouts
                          jmesnil

                          to sum up (again) after irc conversation and meetings:

                          * only *unprepared* tx will be rolled back after timeout
                          * timeout is for tx lifetime, not idle time[1]. So the current code is correct to check Transaction.getCreateTime()
                          * default timeout will be set to the same value than AS (5 mins)

                          I'll reenable the tx timeout handler, change the default value and make sure that all our TX behavior is properly documented.

                          [1] the only ref I found is in the OTS Current::set_timeout interface specification:
                          "If the parameter has a non-zero value n, then top-level transactions created by subsequent invocations of begin will be subject to being rolled back if they do not complete before n seconds after their creation."

                          • 10. Re: Transaction timeouts
                            marklittle

                             


                            We agreed that only unprepared tx must be rolled back as the spec prohibits an RM from unilaterally rolling back / timing out a prepared branch.


                            Where's it say that? XA allows it. So does JTA. They're called heuristics. You need to support it.

                            • 11. Re: Transaction timeouts
                              marklittle

                              By "it" I should have said "timing out after prepare". As I mentioned on the TS forum, if the RM and TM can fail independently then it's possible that a TM may never time out an RM (e.g., it crashes after calling prepare but before writing the log). In which case you either have to allow the RM to timeout and do something itself (probably roll back) and in which case you'll need to remember that choice as a heuristic rollback (in that case) until forget is called, or provide a sys admin way of driving the RM through to completion (and it's still a possible heuristic).

                              • 12. Re: Transaction timeouts
                                jmesnil

                                We do it the 2nd way: we do not time out prepared branch but we provide an admin way to complete heuristically the branch.
                                We persist the heuristic outcome until the TM tells us to forget it.

                                Is that a correct behaviour?

                                • 13. Re: Transaction timeouts
                                  marklittle

                                  That's fine. There are pros and cons with both approaches. Just make sure it's clearly documented as I can guarantee you users will complain either way ;-)