1 2 3 4 5 6 Previous Next 80 Replies Latest reply on Jan 27, 2010 4:51 PM by marklittle Go to original post
      • 45. Re: Jboss transaction recovery issue
        scarceller

        > yes, the app server code is basically a plugin for the XARecoveryModule. The other rec modules don't instantiate XAResources directly - they rely on XARecoveryModule to do it for them. Hence they don't need plugins.

         

        This now leads me to ask a few more questions, please excuse if they are dumb:

         

        1) - earlier I asked about why I never see a commit in the XARecoveryModule? This was back when I questioned this code from the XARecoveryModule

        method xaRecovery(XAResource xares)

        ---------------------------------

        if (doRecovery)
        {
                                            if (jtaLogger.loggerI18N.isInfoEnabled())
                                            {
                                                jtaLogger.loggerI18N
                                                        .info(
                                                                "com.arjuna.ats.internal.jta.recovery.info.rollingback",
                                                                new Object[]
                                                                { XAHelper
                                                                        .xidToString((Xid) xids[j]) });
                                            }

         

                                            if (!transactionLog((Xid) xids[j]))
                                                xares.rollback((Xid) xids[j]);
                                            else
                                            {
                                                /*
                                                 * Ignore it as the transaction system
                                                 * will recovery it eventually.
                                                 */

                                            }
        }

        -------------------------

         

        I was questioning the fact that the XAResourceRecovery module only seems to handle xid.rollbacks() and the fact that I could not see any code for handling a xid.commit() - I also know from looking at my TRACE log that I'm going through the empty else leg that does nothing. The trace also gives me the GlobalTranID and it matches the in-doubt in the Oracle DB that must be commited.

         

        Now stay with me here: you then mentioned that the commit will be handled by the AtomicActionRecoveryModule and in my TRACE log I do indeed see the Atomic recovery module also trying to deal with a xid, I assume it's the one that needs to replay the phase2Commit.

         

        Now for the really dumb question: if indeed the AtomicActionRecoveryModule needs to replay the phase2Commit() how will it be able to do this without a connection? Am I missing something here?

         

        Thanks

        • 46. Re: Jboss transaction recovery issue
          jhalliday

          > if indeed the AtomicActionRecoveryModule needs to replay the phase2Commit() how will it be able to do this without a connection?

           

          It's not concerned with that detail. It invokes replay on the tx, the tx invokes topLevelCommit on the resource records in turn. The XAResourceRecord is the bit that needs to rewire the actual XAResource implementation in order to perform that commit, which it does by calling into the XARecoveryModule. See XAResourceRecord.getNewXAResource()

          • 47. Re: Jboss transaction recovery issue
            marklittle

            OK let's start this by working under the assumption that the scenario Andrew outlined is the one that matches your case. If it isn't or, say, the XAResource is returning XA_RETRY, then we either need to clarify the scenario or we have a problem elsewhere.

             

            Next, let's not use the "million flies" argument. Yes it may well be the case that other application servers you've tested "get this right", but that doesn't mean it's the right thing for them to do. However, neither does it mean that there is no issue in our code, which is why we're trying to get to the bottom of this. Having information about how others behave is a good data point, but it doesn't necessarily indicate a solution.

             

            With that said, let's agree that heuristic outcomes are bad. They break the A in ACID and cannot be resolved automatically. The fact that something could try to do so is actually a bad thing: a heuristic outcome means that the RM did something that went against the true outcome of the transaction. It could have happened just now, or it could have happened hours ago. In either case it's possible that some other application could then have made decisions based on the data that this RM did (or didn't) commit, when in fact that data is erroneous. In which case we're potentially in a cascading rollback scenario, where in order to resolve the issue we (or something) have to chase down numberous applications and correct their data as well. So a coordinator opaquely hiding heuristic decisions is not doing you your applications or your data any favours.

             

            What you'll find throughout the JBossTS codebase is that we try very hard to fail safe and avoid heuristics if at all possible. So for instance, if the first resource we tell to commit throws a heuristic rollback then we'll move the transaction into a rollback phase and try to rollback all of the other participants so that the outcome of them all is rollback, thus avoiding the heuristic outcome. But in same cases it's not possible and where there's any doubt as to the outcome of the transaction we try to give as much information as possible and let the administrator take over where it makes sense (hopefully you'll see that resolving heuristics really needs an understanding of the semantics of the application.)

             

            So if we look at what we do when a crash occurs during the commit call on XAResource, the above may make sense. I'll go through each of the error codes that can legally be placed within the XAException and explain why we do what we do. Then you can maybe let me know what the other implementations are doing differently.

             

            • XAException.XA_HEURHAZ: well this one is easy ;-)
            • XAException.XA_HEURCOM: ditto.
            • XAException.XA_HEURRB, XAException.XA_RB*: the resource has rolled back, so we're in trouble.
            • XAException.XAER_RMERR and XAException.XAER_PROTOthe XA specification is pretty clear on these and that we have to consider them to have rolled back.
            • XAException.XA_HEURMIX: another easy one.
            • XAException.XAER_NOTA: if we get this during recovery then we assume a previous call to commit worked and ignore. Otherwise the RM is saying that it doesn't know about a transaction we know about so there's a discrepancy there and we fail safe and assume a hazard.
            • XAException.XA_RETRY: here we could try again immediately, but instead we rely on recovery to kick in periodically when it checks the log. If your RM is returning this and we are not replaying the transaction on this RM then there is an issue.
            • XAException.XAER_INVAL and XAException.XAER_RMFAIL: there's another potential discrepancy here and no guarantee that retrying will do any good. Therefore we assume a hazard and let the admin tidy this up. Now there's an argument to be had that in some situations it may be OK to retry on an XAER_RMFAIL, but if that's the case then XA_RETRY really could have been thrown in the first place.

             

            And that's it. No other error codes are permitted by the standard.

             

            Therefore this comes down to what error code is your RM returning and under what situation? If you can let us known then maybe we can suggest a solution. Then again, as I said at the start, maybe this isn't your scenario and we're looking at the wrong area of the code.

            • 48. Re: Jboss transaction recovery issue
              marklittle

              I see that Jonathan covered a lot of the same ground (teaches me to spend a while writing the post

               

              If you're going to instrument code then you may want to look at XAResourceRecord (two copies, one in JTA and one in JTAX - you want the former if you're using local transactions and the latter if using JTS). Specifically the topLevelCommit call which I assume is the one being hit when you kill your db and from which the RM is driven through commit. Actually if there is a problem during the commit invocation on your RM then you should see a warning of the form: commit on {0} ({1}) failed with exception {2}, where {0} is the transaction id, {1} is the RM and {2} is the XAException error code. I don't recall seeing those in the log you provided, but I may have missed them so I'll recheck myself later.

              • 49. Re: Jboss transaction recovery issue
                jhalliday

                > Actually if there is a problem during the commit invocation on your RM then you should see a warning...

                 

                you're still a page or two behind, or you missed the bit with:

                 

                   grep the logs on "com.arjuna.ats.internal.jta.resources.arjunacore.commit"

                 

                which is the key prefix for those warnings.

                 

                >  I don't recall seeing those in the log you provided

                 

                because it's for the run after the crash/restart, not the one before where the error would occur. Try to keep up ;-)

                • 50. Re: Jboss transaction recovery issue
                  marklittle

                  Aaahhh! ;-)

                  • 51. Re: Jboss transaction recovery issue
                    scarceller

                    Mark,

                     

                    Most of my logs are from after the problem happens because during the test phase I only have INFO default logging as I want max throughput.

                     

                    Will I have enough info in the logs with just INFO logging?

                     

                    I'll re-try the test case and save the entire log and try to find those log statements. I'll need some time to do this as I'm side tracked now for a while maybe till early next week but I'll see what I can do before EOD Friday.

                     

                    -------------------------------

                    But at the same time I can tell you that the AS and TM is simply stuck in a loop tring to recover the xid in the 2nd DB, I know this because the xid # printed by the XARecoveryModule is a match to the ID in Oracle.

                     

                    I also see the AtomicRecovery module trying to replay phase2Commit() it says so right in the logs I already sent you. I'm simply wondering if there is a way to debug and see what's going on in the AtomicRecovery during this replay phase? I'm only speculating here but I think the AtomicRecovery module may be trying to commit this in-doubt but I have nothing in the log to indicate it did or didn't. Could you simply give this section of processing some thought and a quick look in the logs I sent. Just look for "recoverAtomicAction.replayPhase2 recovering" also you'll see "basicAction.phase2Commit()" it's in the log I sent you. It's evident AtomicRecovery is trying to recover something. Is it in any way possible that it can't get a connection to the DB to preform the recovery from within AtomicRecovery? It's just a hunch on my part.

                     

                    Also keep in mind that if the ObjectStore does not have the matching xid for the in-doubt then the XARecoveryModule does a rollback on the xid in Oracle just fine. Meaning if I stop the AS, delete the ObjectStore and then re-start the AS the XARecoveryModule takes over and does the rollback. Of course this is the wrong action to take because the 1st DB has the Order comitted and the 2nd DB was instructed to roll it back. But this is still worth concidering and I have tried it. The real problem is that I have NEVER seen the AtomicRecovery Commit an in-doubt.

                    • 52. Re: Jboss transaction recovery issue
                      marklittle

                      No worries about getting the information: just whenever you can. Yes, if the commit call fails as I mentioned earlier, you should see warnings, so INFO will be fine.

                       

                      I looked through the last log you sent and saw:

                       

                      Could not find new XAResource to use for recovering non-serializable XAResource < 131075, 29, 27, 4945455110253555757521025898549958529853555049485658521005056455110253555757521025898549958529853555049485658521005257 >

                       

                      This means the reference to the XAResource in the log is not valid, so the XARecoveryModule needs to kick in and provide a replacement. That seems to happen because later in the log when recovery retries for those 3 transactions the warning no longer occurs. However, the important thing is that in both cases (initial when there's an invalid XAResource and subsequently when there's a new instance) the state of the transaction is that it's in a heuristic situation:

                       

                      10:48:27,781 DEBUG [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_23] - HeuristicList - Unpacked a 463 record

                      10:48:27,781 DEBUG [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_25] - Restored action status of ActionStatus.COMMITTED 7

                      10:48:27,781 DEBUG [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_26] - Restored action type Top-level

                      10:48:27,781 DEBUG [arjLoggerI18N] [com.arjuna.ats.arjuna.coordinator.BasicAction_27] - Restored heuristic decision of TwoPhaseOutcome.HEURISTIC_HAZARD

                       

                      Which means that recovery won't happen, even with a new XAResource instance, and eventually the logs will be moved if the only participant entry is related to the heuristic.

                       

                      So the question remains: what caused the transaction to get into a heuristic situation? You've seen enough discussions so far to show how and why the heuristic could occur, but it relies on getting that data from you on what value is being returned from the initial commit call on the XAResource. Once we have that we should be able to make progress.

                      • 53. Re: Jboss transaction recovery issue

                        Mark,

                        Could I consider the problem may be Jboss TS's bug or issue now? Please answer this question explicitly.

                        I think the problem is widespread in Jboss TS, not special with different db driver or AS.

                        I hope you could run the test case yourself, the source code of my test case is attached.

                        Run the test:

                        (1)Need standalone Jboss TS with JacOrb 2.2.1.

                        (2)Need two Oracle servers.

                        (3)Specify the Oracle connection information in class Participant1 and Participant2's construct method.

                        (4)Add breakpoint in class quq.ots.example.resource.OracleResource at line 117 in method commit().

                        (5)Debug the class Coordinator (in Eclipse).

                        (6)At the second time the breakpoint hit, stop the appropriate Oracle server.

                        (7)Resatart the Oracle server stoped in step 6, and then "select * from dba_2pc_pending" could find the uncompleted transaction.

                        (8)Restart the Recovery service (because the reconnection can't be completed successfully).

                        (9)The transaction disappeared in view dba_2pc_pending, and in fact it rolled back.

                         

                        Thank you!

                        • 54. Re: Jboss transaction recovery issue
                          marklittle
                          First, there was no attachment. Second, although we try to do our best with answering questions on the forums it's only ever best effort because we're usually pretty busy supporting customers with guaranteed SLAs. Third, it seems that although you raised this issue first it's been taken over more by others, so we either need to open another forum entry or you need to confirm that your situation is exactly the same as the other one: otherwise there'll be far too much confusion here.
                          • 55. Re: Jboss transaction recovery issue
                            marklittle
                            Oh and in answer to your "Could I consider the problem may be Jboss TS's bug or issue now? Please answer this question explicitly." it's pretty simple: you are always free to consider this to be a bug/issue with JBossTS but there is no evidence to support that position at this point. That's what we're trying to work through here.
                            • 56. Re: Jboss transaction recovery issue
                              scarceller

                              <- XAException.XAER_INVAL and XAException.XAER_RMFAIL: there's another potential discrepancy here and no guarantee that retrying will do any good. Therefore we assume a hazard and let the admin tidy this up. Now there's an argument to be had that in some situations it may be OK to retry on an XAER_RMFAIL, but if that's the case then XA_RETRY really could have been thrown in the first place.

                               

                              Mark, in the earlier post you gave very good details on how the TM handles exceptions. Above is your details for those specific exceptions and I understand the reason for your exception processing:

                               

                              • Could you explain how the admin will tidy things up?
                              • Can a JBoss Admin easily force the TM replay of this xid?
                              • Does JBoss have some GUI or command to do this?
                              • Or do you mean the DB admin resolves this?

                               

                              Thanks.

                              • 57. Re: Jboss transaction recovery issue
                                marklittle

                                There's an ObjectStore Browser/Viewer in the distribution. The intent is that you use this (unless other tools are available) to view a transaction log and from there you can drive the recovery of a log manually.

                                 

                                However, I think we'd still be interested to know what code your XAResource implementation is returning and in what situation. So if you do get that data let us know.

                                • 58. Re: Jboss transaction recovery issue
                                  scarceller

                                  quqtalk wrote:

                                   

                                  Hi, all:

                                       My scenario:

                                  gif_1.gif

                                  Above is my scenario. As we all know, in the two-phase commit process:

                                  The first step, prepare the XAResources one by one.

                                  The second, commit the XAResources in order.

                                  In commit step, I did these:

                                  There are two XAResources in the transaction, after committing of the oracleResource_1 completed, I stop the database server at Machine_3 before start committing the oracleResource_2. Then there will be an exception, and this transaction need recovery.

                                  Because of that the oracleResource_1 is committed, so the data is persisted to the file system, it is not possible tp rollback. In other works, for keep the transaction's ACID the oracleResource_2 must be committed too in recovery process. However in the XARecoveryModule the operation is rollback.

                                  Is this a bug or I understand wrong?

                                  Above you said:

                                  <-

                                  Then there will be an exception, and this transaction need recovery.

                                  Do you know what exception it is? something like XAException.XAER_???

                                   

                                  Then you also said:

                                  However in the XARecoveryModule the operation is rollback.

                                  How do you know it's a rollback?

                                  1. Are you saying the 2nd resource gets rolledback? (This would be clearly wrong action to take)
                                  2. Or are you just seeing something like this in the log:

                                  13:41:20,343 DEBUG [loggerI18N] [com.arjuna.ats.internal.jta.recovery.info.rollingback] Rolling back < 131075, 29, 27, 4945455110253555757521025898549958529853555049485658521005056455110253555757521025898549958529853555049485658521005257 >
                                  but the in-doubt in the 2nd resource is NOT rolledback.

                                   

                                  I hope it's not the first case as this would not be good. In the second case that log entry is simply a miss placed log in the XARecoveryModule.java, I actually looked at the source code and found this. See post #18 - it shows where that message is being logged and right after it logs it you see an if().else and the else leg is empty. I think you are going into the else leg that does nothing even though the log msg was written indicating rollback. I found this to be the case in my scenario.

                                   

                                  So it's important for you to let us know what happens to the in-doubt in the 2nd resource? does it get rolledback or simply stay in an in-doubt state forever?

                                   

                                  Hope this helps.

                                  • 59. Re: Jboss transaction recovery issue
                                    marklittle
                                    I should point out that the ObjectStore Browser is something Jonathan and the team are looking to replace. It's an old Swing app from 2000/2001 and a bit clunky these days.