1 2 Previous Next 19 Replies Latest reply on May 29, 2019 11:06 AM by ochaloup

    XTS inbound bridge fails to commit during crash recovery

    ochaloup

      Hi,

       

      I have a question as follow-up to the fix of of the issue [JBTM-3079] InboundBridge recovery aborts live transactions ([Pull Request #1387). I would like check this particularly with jhalliday if he has time to verify.

       

      When applying the patch I was curious why the testcase does not contain test for commit scenario - like this: inbount txbridge - adding commt crashrec test · ochaloup/narayana@d4ee0b8.

      The idea of the test is the same as in other scenarios. Which is in my understanding:

      • A test client is deployed on the WFLY
      • Testcase calls the client
      • The client starts JTA transaction
      • The client invokes Webservice call to the WFLY
      • The webservice did some work - in this case it enlists a mock TestXAResource in the business method
      • The call comes back to client which commits the transaction
      • The 2PC starts
      • The prepare phase passes (there is  BridgeDurableParticipant, BridgeVolatileParticipant and TestXAResource)
      • The commit phase starts and the JVM crashes
      • The recovery is expected to commit the transaction but the transaction is rolled-back

       

      I tried to investigate what's happening and it's the ATParticipantRecoveryModule runs ParticipantEngine during recovery

      Thread [Periodic Recovery] (Suspended)
      ParticipantEngine.recovery() line: 315
      ATParticipantRecoveryRecord.activate() line: 68
      XTSATRecoveryManagerImple.recoverParticipants() line: 219
      ATParticipantRecoveryModule.processParticipantsStatus() line: 265
      ATParticipantRecoveryModule.periodicWorkSecondPass() line: 137
      PeriodicRecovery.doWorkInternal() line: 816
      PeriodicRecovery.run() line: 382

      as the state is prepared there is executed the WS call narayana/ParticipantEngine.java at 5.9.0.Final · jbosstm/narayana · GitHub -> narayana/ParticipantEngine.java at 5.9.0.Final · jbosstm/narayana · GitHub  sendPrepared() which ends at the CoordinatorProcessorImpl#prepared. As there was no "activation" of the txn id for the CoordinatorProcessor there is found no coordinator narayana/CoordinatorProcessorImpl.java at 5.9.0.Final which ends for rollback narayana/CoordinatorProcessorImpl.java at 5.9.0.Final.

       

      It sounds like there is a missing  some activation for the txn bridge subordinate participant as for example the AT participant is "reactivated" for the CoordinatorProcessor in the restoreState with the id here - narayana/ParticipantStub.java at 5.9.0.Final. Or the reactivation should be done differently, eg. at the place where participant is loaded? Or do I have some test misuderstanding?

       

      Thanks

      Ondra

        • 1. Re: XTS inbound bridge fails to commit during crash recovery
          ochaloup

          I created the pull request to Narayana which contains the new commit test that fails:

          Adding inbound txbridge commit crashrec test by ochaloup · Pull Request #1391 · jbosstm/narayana · GitHub

          • 2. Re: XTS inbound bridge fails to commit during crash recovery
            ochaloup

            I did a further investigation on this and I think there is an issue for txbridge which harms data consistency. The trouble comes into occurrence when 1PC is used.

            When JTA transaction is started, the inbound bridge is used to pass the transaction to WS call. The WS call runs some business processing where XA resources are used. The initial call from the tx bridge observes the WS call as a single resource. That way the 1PC is used. But the WS calls can work with multiple resource on behalf.

            JTA started -> inbound bridge -> WS call -> business method -> using several XAResources

            The trouble is that the com.arjuna.mwlabs.wscf.model.twophase.arjunacore.ParticipantRecord#topLevelOnePhaseCommit invokes the com.arjuna.mwlabs.wst.at.participants.DurableTwoPhaseCommitParticipant#confirmOnePhase which does not call just "confirm()" but it runs the prepare and confirm, the both phases. But after prepare is finished there is no data stored to the object store. That means if the "confirm" method crashes there is no data that could be used during recovery.

             

            I don't understand if this behaviour was intentional but omitted to consider the trouble with 1PC or it was intentional and it works this way for some reason. I diged around in the code, reading comments etc. but I can't find.

             

            Would you have some point for me tomjenkinson mmusgrov jhalliday ? Thanks.

            • 3. Re: XTS inbound bridge fails to commit during crash recovery
              tomjenkinson
              • 4. Re: XTS inbound bridge fails to commit during crash recovery
                ochaloup

                Thanks tomjenkinson for the links and ideas.

                 

                It gives me some hints how to look into the issue. I continued with the investigation. And now I can see that this is not only an issue for the txbridge but it's a general issue for XTS. There is already opened the issue [JBTM-396] Provide one-phase commit optimization for WS-TX - JBoss Issue Tracker  which says the 1PC optinmization for XTS should be implemented. But there is not said that the current state causes data inconsistency.

                There is a interestingly mentioned this in the xtstest as well: narayana/README at 5.9.5.Final · jbosstm/narayana · GitHub

                 

                I found that XTS tests run similar tests as this one: narayana/SingleParticipantPrepareAndCommitTest.java at 5.9.5.Final · jbosstm/narayana · GitHub

                And is run in crash recovery tests by: narayana/TestATCrashDuringOnePhaseCommit.java at 5.9.5.Final · jbosstm/narayana · GitHub

                And the test behaves the same way - it crashes during commit and recovery rolls back it. As there is no other resources as part of the processing the test passes. (In fact the test itself just check that the resource was about to be committed and then recovered. It does not consider the outcome of the recovery.)

                Anyway, I think the code under DurableTwoPhaseCommitParticipant.confirmOnePhase (narayana/DurableTwoPhaseCommitParticipant.java at 5.9.5.Final · jbosstm/narayana · GitHub ) is wrong. It process the two phases (see the method contains call to prepare() and then just few lines below the call to commit()) but it processes them without saving state to the object store. The DurableParticipantStub then sendPrepare and in while it sendCommit but without persisting the state. I think the ATCoordinator should not permit the 1PC optimization in general.

                 

                I think the similar solution as you used in the [JBTM-2916] Disable dynamic1PC for subordinate transactions - JBoss Issue Tracker  should be used. When the pendingList.size() == 1 (narayana/BasicAction.java at 5.9.5.Final · jbosstm/narayana · GitHub ) and when the participant record permits 1PC. Only in such case it should be permitted to run the 1PC.

                 

                 

                I wonder if the same problem could not be an issue for the JTA ejb remote XA  calls. I need to check.

                 

                Stacktrace of the call for the XTS 1PC call:

                Daemon System Thread [TaskWorker-1] (Suspended (breakpoint at line 81 in DurableTwoPhaseCommitParticipant))    
                owns: ATCoordinator  (id=1415)    
                owns: ActivityImple  (id=1416)    
                DurableTwoPhaseCommitParticipant.prepare() line: 81    
                DurableTwoPhaseCommitParticipant.confirmOnePhase() line: 263    
                ParticipantRecord.topLevelOnePhaseCommit() line: 427    
                ATCoordinator(BasicAction).onePhaseCommit(boolean) line: 2386    
                ATCoordinator(BasicAction).End(boolean) line: 1497    
                ATCoordinator(TwoPhaseCoordinator).end(boolean) line: 96    
                CoordinatorControl.complete(CompletionStatus) line: 137    
                TwoPhaseHLSImple.complete(CompletionStatus) line: 130    
                ActivityImple.end(CompletionStatus) line: 289    
                UserActivityImple.end(CompletionStatus) line: 261    
                CoordinatorServiceImple.confirm() line: 156    
                CompletionCoordinatorImple.commit() line: 41    
                CompletionCoordinatorProcessorImpl.commit(Notification, MAP, ArjunaContext) line: 84    
                CompletionCoordinatorPortTypeImpl$1.executeTask() line: 58    
                TaskWorker.run() line: 63    
                Thread.run() line: 748    
                • 5. Re: XTS inbound bridge fails to commit during crash recovery
                  tomjenkinson

                  ochaloup  wrote:

                   

                   

                  I wonder if the same problem could not be an issue for the JTA ejb remote XA  calls. I need to check.

                   

                  I might be wrong but I think the intention was to blanket disable the 1PC for subordinate JTA transactions, does that match with what you are seeing?

                   

                  ochaloup  wrote:

                   

                  I think the similar solution as you used in the [JBTM-2916] Disable dynamic1PC for subordinate transactions - JBoss Issue Tracker  should be used. When the pendingList.size() == 1 (narayana/BasicAction.java at 5.9.5.Final · jbosstm/narayana · GitHub ) and when the participant record permits 1PC. Only in such case it should be permitted to run the 1PC.

                   

                   

                  You could reach out to Andrew, Paul, Jonathan or Mark for some information on this too for the background on any historic XTS decisions.

                  • 6. Re: XTS inbound bridge fails to commit during crash recovery
                    ochaloup

                    Thanks Tom, I have two findings from my investigation.

                     

                    Let's summarize the scenario I work with. It's basically the same either for JTA-ejb-remoting or WS-txbridge. There is a client which makes a remote call to the second server. The transaction context is propagated along the call. The second server runs the subordinate transaction where two XAResources are in use. As the client knows only one resource it goes with 1PC. The second server runs the 2PC as there is two XAResources available. A failure happens during commit is called on one of the XAResources at the second server.

                     

                    In case of the JTA I run with the following failures

                    • The second server crashes JVM. The JVM crash is reported as an error for the 1PC on the client. The client saves this as an heuristic outcome of the transaction to the transaction object store. When the server is restarted it contains a record of prepared XAResource.
                      Now, it's up to the administrator to finish the resource with the heuristics manually and consider if the XAResource on the server should be committed. This sounds being a correct behaviour.
                    • The XAResource commit throws an XAException.XAER_RMFAIL. In such case the XAResource.commit is stored to the second server object store as failed but the transaction outcome is reported back to the client as success. After that there is a subordinate XAResource which is stored "for infinity" at the second server. It waits for an administrator to finish it. But the administrator does not receive information about this need. This sounds a bit incorrect to me. But as I study the code this behaviour is pretty wired for Narayana. This is the stacktrace[1].
                      What do you think about this case tomjenkinson, mmusgrov ?

                     

                    In case of the XTS it's interesting that even for the 1PC two phases are run: narayana/DurableTwoPhaseCommitParticipant.java at 5.9.5.Final · jbosstm/narayana · GitHub

                    On top of that the commit error is just logged[2] (or narayana/ParticipantProcessorImpl.java at 5.9.5.Final · jbosstm/narayana · GitHub, Throwable is just logged).

                    jhalliday adinn would you have some idea about this? Thanks.

                     

                    [1]

                    at org.jboss.as.test.jbossts.common.TestXAResource.commit(TestXAResource.java)
                    at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelCommit(XAResourceRecord.java:473)
                    at com.arjuna.ats.arjuna.coordinator.BasicAction.doCommit(BasicAction.java:2892)
                    at com.arjuna.ats.arjuna.coordinator.BasicAction.doCommit(BasicAction.java:2808)
                    at com.arjuna.ats.arjuna.coordinator.BasicAction.phase2Commit(BasicAction.java:1873)
                    at com.arjuna.ats.arjuna.coordinator.BasicAction.End(BasicAction.java:1529)
                    at com.arjuna.ats.internal.jta.transaction.arjunacore.subordinate.SubordinateAtomicAction.doOnePhaseCommit(SubordinateAtomicAction.java:244)
                    at com.arjuna.ats.internal.jta.transaction.arjunacore.subordinate.TransactionImple.doOnePhaseCommit(TransactionImple.java:259)
                    at org.wildfly.transaction.client.provider.jboss.JBossLocalTransactionProvider$Entry.commit(JBossLocalTransactionProvider.java:500)
                    at org.wildfly.transaction.client.provider.remoting.TransactionServerChannel.lambda$handleXaTxnCommit$7(TransactionServerChannel.java:624)
                    at org.wildfly.security.auth.server.SecurityIdentity.runAsConsumer(SecurityIdentity.java:361)
                    at org.wildfly.transaction.client.provider.remoting.TransactionServerChannel.handleXaTxnCommit(TransactionServerChannel.java:615)
                    at org.wildfly.transaction.client.provider.remoting.TransactionServerChannel$ReceiverImpl.handleMessage(TransactionServerChannel.java:124)
                    at org.jboss.remoting3.remote.RemoteConnectionChannel.lambda$handleMessageData$3(RemoteConnectionChannel.java:430)
                    at org.jboss.remoting3.EndpointImpl$TrackingExecutor.lambda$execute$0(EndpointImpl.java:949)
                    at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
                    at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1982)
                    at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1486)
                    at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1377)
                    at java.lang.Thread.run(Thread.java:748)

                     

                    [2]

                    ARJUNA016141: Error committing transaction 'TransactionImple <jca-subordinate, basicaction::ffff0a280527:-4db02b44:5cc6e3f3:23,status:actionstatus.committed="">' for xid: <131080,29,64,0000000000-1-11040539-7879-44-6892-58-29-130002749,0000000000000000000000000000000000000000000000000000000000000000>
                    at com.arjuna.ats.internal.jta.transaction.arjunacore.jca.XATerminatorImple.commit(XATerminatorImple.java:109)
                    at org.jboss.jbossts.txbridge.inbound.BridgeDurableParticipant.commit(BridgeDurableParticipant.java:205)
                    at com.arjuna.wst11.messaging.engines.ParticipantEngine.executeCommit(ParticipantEngine.java:576)
                    at com.arjuna.wst11.messaging.engines.ParticipantEngine.commit(ParticipantEngine.java:149)
                    at com.arjuna.wst11.messaging.ParticipantProcessorImpl.commit(ParticipantProcessorImpl.java:99)
                    at com.arjuna.webservices11.wsat.sei.ParticipantPortTypeImpl$2.executeTask(ParticipantPortTypeImpl.java:84)
                    at com.arjuna.services.framework.task.TaskWorker.run(TaskWorker.java:63)
                    at java.lang.Thread.run(Thread.java:748)
                    • 7. Re: XTS inbound bridge fails to commit during crash recovery
                      ochaloup

                      I might be wrong but I think the intention was to blanket disable the 1PC for subordinate JTA transactions, does that match with what you are seeing?

                      tomjenkinson No, I don't observe the 1PC being disabled. In case of failure for the 1PC the transaction is expected to be stored with heuristic outcome to Narayana object store.

                      • 8. Re: XTS inbound bridge fails to commit during crash recovery
                        tomjenkinson

                        ochaloup  wrote:

                         

                        I might be wrong but I think the intention was to blanket disable the 1PC for subordinate JTA transactions, does that match with what you are seeing?

                        tomjenkinson  No, I don't observe the 1PC being disabled. In case of failure for the 1PC the transaction is expected to be stored with heuristic outcome to Narayana object store.

                        Looking into it again some more I think this was just in the case where something returns RDONLY it disables the optimization in that case

                        • 9. Re: XTS inbound bridge fails to commit during crash recovery
                          tomjenkinson

                          ochaloup  wrote:

                           

                           

                          In case of the JTA I run with the following failures

                          • The second server crashes JVM. The JVM crash is reported as an error for the 1PC on the client. The client saves this as an heuristic outcome of the transaction to the transaction object store. When the server is restarted it contains a record of prepared XAResource.
                            Now, it's up to the administrator to finish the resource with the heuristics manually and consider if the XAResource on the server should be committed. This sounds being a correct behaviour.
                          • The XAResource commit throws an XAException.XAER_RMFAIL. In such case the XAResource.commit is stored to the second server object store as failed but the transaction outcome is reported back to the client as success. After that there is a subordinate XAResource which is stored "for infinity" at the second server. It waits for an administrator to finish it. But the administrator does not receive information about this need. This sounds a bit incorrect to me. But as I study the code this behaviour is pretty wired for Narayana. This is the stacktrace[1].
                            What do you think about this case tomjenkinson , mmusgrov  ?

                           

                           

                          Why does this happen for infinity? Won't the periodic recovery on the second server eventually commit the transaction?

                          • 10. Re: XTS inbound bridge fails to commit during crash recovery
                            ochaloup

                            tomjenkinson that's a pretty good point!

                            I thought it's obvious as the second sever manages the subordinate transaction thus it won't be touched by the second server. Only the client can force the subordinate transaction on the second server to finish.

                            But then I realized that's not all. The client should run the orphan detection. It should find there is an unfinished transaction on the remote server and it should invoke the rollback on it. But this does not happen.

                             

                            After some more investigation I realized that's because how the WFTC implements the registry for XA remote subordinate resources. There is used a file system registry that provides an information for the WFTC to announce an unfinished remote resource. Such resource is then announced to the periodic recovery and the orphan detection can be run. The client receives success on commit (as explained above). Because of that it removes the record (wildfly-transaction-client/SubordinateXAResource.java at 1.1.3.Final · wildfly/wildfly-transaction-client · GitHub) and the recovery obtains no XAResource to work with during the second phase (bottom-up recovery).

                             

                            Anyway, running the roll-back by orphan detection would cause data inconsistency. The current behaviour, even I haven't assumed it, seems mostly correct.

                             

                            Still it does not seems to be fully correct that the client does not mark the transaction with heuristic outcome.

                            • 11. Re: XTS inbound bridge fails to commit during crash recovery
                              tomjenkinson

                              ochaloup  wrote:

                               

                              tomjenkinson  that's a pretty good point!

                              I thought it's obvious as the second sever manages the subordinate transaction thus it won't be touched by the second server. Only the client can force the subordinate transaction on the second server to finish.

                              But then I realized that's not all. The client should run the orphan detection. It should find there is an unfinished transaction on the remote server and it should invoke the rollback on it. But this does not happen.

                               

                              After some more investigation I realized that's because how the WFTC implements the registry for XA remote subordinate resources. There is used a file system registry that provides an information for the WFTC to announce an unfinished remote resource. Such resource is then announced to the periodic recovery and the orphan detection can be run. The client receives success on commit (as explained above). Because of that it removes the record (wildfly-transaction-client/SubordinateXAResource.java at 1.1.3.Final · wildfly/wildfly-transaction-client · GitHub) and the recovery obtains no XAResource to work with during the second phase (bottom-up recovery).

                               

                              Anyway, running the roll-back by orphan detection would cause data inconsistency. The current behaviour, even I haven't assumed it, seems mostly correct.

                               

                              Still it does not seems to be fully correct that the client does not mark the transaction with heuristic outcome.

                              I was thinking that the second server would commit the second resource but I think the discussion is getting a little lost now. Maybe you can provide an sdedit sequence diagram we can talk through?

                              • 12. Re: XTS inbound bridge fails to commit during crash recovery
                                ochaloup

                                I was thinking that the second server would commit the second resource but I think the discussion is getting a little lost now. Maybe you can provide an sdedit sequence diagram we can talk through?

                                 

                                tomjenkinson I struggled to create the diagram in sdedit. I decided to provide a diagram in draw.io.

                                The second resource wont' be commit as there is only one XAResource at the client server and 1PC is run. There is no record saved at the client side and recovery has no information to run commit.

                                • 13. Re: XTS inbound bridge fails to commit during crash recovery
                                  tomjenkinson

                                  Great - thanks for the diagram! Can you give edit permissions to me and the original file? In the meantime as well can you say why it is heuristic on the diagram - like add in the EJB proxy returning RMERR or whatever the code is that is coming back on the failure arrow?

                                   

                                  Some possible solutions are to use the tooling on the client app server to resolve the heuristic. Also, I think on the second server it should be possible to make periodic recovery commit that second resource if it knows that the XAR1 committed and the object store is prepared.

                                   

                                  Anyway, lets start with the diagram having the status codes on there so I can be sure of what we are talking about. Thanks!

                                  • 14. Re: XTS inbound bridge fails to commit during crash recovery
                                    tomjenkinson

                                    And you did explain why the first server wouldn't store the heuristic.

                                     

                                    Maybe what we should be is make it so that whenever EJB remoting is used 1PC is always disabled (or the recovery optimization would maybe help but as it prepares it would need something to know it was committed with 1PC and can be cleaned up)

                                    1 2 Previous Next