12 Replies Latest reply on Apr 1, 2015 12:54 PM by mmusgrov

    Issue using REST-AT

    ajcmartins

      Hi, i have following scenario based on the quickstarts:

      • 2 different applications running on two Wildfly 8.2 (narayana-rts 5.0.0-Final) servers with the rts subsystem enabled . appA and appB thereof
      • appA is the client application running on the server that also acts as transaction/recovery coordinator
      • appB is the server application that contains a durable participant
      • appA starts a transaction and also enlists itself as participant of that transaction
      • appA invokes appB with the enlistURL obtained from the rest-at coordinator
      • appB creates a durable work unit  and enlists it with URL provided on the invocation
      • appA commits the transaction
      • the transaction coordinator sends the prepare state to appB
      • appB votes with Prepared
      • the transaction coordinator sends the prepare state to appA
      • at this point i force KILL the appB server
      • the transaction coordinator sends the commit state to appB which off course fails
      • the transaction coordinator sends the commit state to appA which has success

       

      Now the issue is that after i restart the server of appB the transaction doesn't recover. This is because as soon as the pending participant information is loaded from disk it's immediately aborted with a rollback.

      I tracked this action to the method that synchronizes the participant info with the recovery coordinator when starting up the system. This is a PUT operation that invokes the replaceParticipant on the transaction/recovery coordinator and that is returning a 404.

      After everything finishes loading,  i can see the logs from the transaction/recovery coordinator periodic recovery system trying to do restore_state -> commit on appB, this off course fails because the participant info was deleted during the startup...


      Is anyone aware if this a bug that existed/exists or is just me that is missing something?


      Cheers,

        • 1. Re: Issue using REST-AT
          gytis

          Hello,

           

          do you have log files with trace level enabled for org.jboss.jbossts.star and org.jboss.narayana.rest categories? That would be really helpful.

           

          Also, which quickstart did you use?

           

          Thanks,

          Gytis

          • 2. Re: Issue using REST-AT
            ajcmartins

            Hello and thanks for your reply. Please take a look at the logs and see if that helps.

            What i am doing is based on the recovery2 quickstart.

             

            Thank you,

            • 3. Re: Issue using REST-AT
              ajcmartins

              Ok Gytis, at this point i am almost sure it's a bug.

              If i do a new step and restart server1 (contains the transaction/recovery coordinator)  before getting server2 back up , then everything recovers correctly. It seems that something isn't being updated correctly after the commit fails, the same something that is being updated on the coordinator restart.

               

              I just need someone with a better insight to confirm this, and you look like being that person.

               

              Thanks,

              • 4. Re: Issue using REST-AT
                gytis

                Thanks for an update.

                I see that server2 fails to contact recovery coordinator. Leave this with me and I'll try to figure this out. I'll let you know once I have something.

                 

                Thanks,

                Gytis

                • 5. Re: Issue using REST-AT
                  tomjenkinson

                  Can you produce a simple test that replicates the issue at all as it will really help to diagnose the issue if we have something to fire up?

                   

                  Thanks,

                  Tom

                  • 6. Re: Issue using REST-AT
                    ajcmartins

                    Hello Tom, i am unable at this point to provide that kind of test. Nevertheless since i was suspecting about restarting fixing the problem, i went ahead downloaded the code from github and applied a small patch that solved the issue that i was experiencing.

                     

                    I added the following code just after the line at: narayana/Coordinator.java at master · jbosstm/narayana · GitHub

                    recoveringTransactions = getRecoveringTransactions(transactions);

                     

                    I don't actually know if it's valid or how negative the impact may be on other situations. But maybe it helps shedding light on what may be happening?

                     

                    Thanks,

                    • 7. Re: Issue using REST-AT
                      mmusgrov

                      I took a look at the logs and it appears that the application itself is aborting the transaction. Here is what I can glean from the logs you uploaded:

                       

                      Server B did know about the participant (appB) and wrote its state into persistent storage before the crash (see message with timestamp 2015-03-26 10:25:18,298 in server2.log)


                      The coordinator on server 1 has logged the transaction and is now trying to replay it on appB but is receiving 404 because server B does not know about the participant (this is message timestamp 2015-03-26 10:28:53,044 in server1.og).

                       

                      Since server B did persist details about appB something has removed the persistent log for it. This happened at timestamp 2015-03-26 10:27:05,024 in server2.log:

                       

                      2015-03-26 10:27:05,024 INFO  [services.iap.subscriptions.service.tx.UpdateSubscriptionWork] (MSC service thread 1-3) Aborting transaction..

                       

                      This is coming from a thread inside App B itself - can you take a look at your application code and figure out under what conditions it will abort the transaction branch - ie I don't think it is REST-AT framework code that is issuing the abort request.

                      • 8. Re: Issue using REST-AT
                        ajcmartins

                        Hello Michael,

                         

                        like i said on my first post the rollback is happening on the server B restart during the local recovery system startup. During this flow, the local recovery system tries to sync/update it's info on the transaction/recovery coordinator (server A) which in turn answers with a 404 stating that the transaction doesn't exists  The code that does this is at:

                        The log of this invocation on server A is:

                        2015-03-26 10:27:05,001 TRACE [org.jboss.jbossts.star.service.Coordinator] (default task-17) coordinator: replace: recovery-coordinator/0_ffffc0a801f1_62c1ef98_5513de56_49?URL=http://192.168.3.227:8180/rest-at-participant/0:ffffc0a801f1:-1c33e0e9:5513de66:19

                         

                        Now the problem (and bug in my understanding) is that the transaction coordinator answers with the 404 to this operation despite knowing about the transaction since he keeps trying to replay it.

                         

                        Thanks,

                         

                        P.S - the "Aborting transaction" message may be misleading. That's just a log that is made on the rollback method implementation of the participant interface. It should be read as "Received rollback callback"

                        • 9. Re: Issue using REST-AT
                          mmusgrov

                          mmusgrov wrote:

                           

                          This is coming from a thread inside App B itself - can you take a look at your application code and figure out under what conditions it will abort the transaction branch - ie I don't think it is REST-AT framework code that is issuing the abort request.

                          Ah wait. There is some code in recreateParticipantInformation that will do the abort. Let me look further into what's happening ...

                          • 10. Re: Issue using REST-AT
                            mmusgrov

                            ajcmartins wrote:

                             

                            like i said on my first post the rollback is happening on the server B restart during the local recovery system startup. During this flow, the local recovery system tries to sync/update it's info on the transaction/recovery coordinator (server A) which in turn answers with a 404 stating that the transaction doesn't exists  The code that does this is at:

                             

                            P.S - the "Aborting transaction" message may be misleading. That's just a log that is made on the rollback method implementation of the participant interface. It should be read as "Received rollback callback"

                            When Server A looks for the transaction id it searches in memory for it and if it isn't there it looks for it on disk but if fails to find it so yes I agree it does look like a bug in the coordinator. Is there any other log output from server A when it does this check (should be about timestamp 10:27:04,807 in amongst the replay requests). I think I will probably need to put a debugger on it to figure out why its not found. Since I will need to first recreate your issue it may take a couple of days.


                            You also said that calling recoveringTransactions = getRecoveringTransactions(transactions); fixed the problem. This is good but puzzling since the method replaceParticipant in the coordinator already does this if the transaction isn't in memory so another reason why we probably need to debug it.

                            • 11. Re: Issue using REST-AT
                              ajcmartins

                              Great! I am glad that i could at least help you going on the right direction.

                               

                              Cheers,

                              • 12. Re: Issue using REST-AT
                                mmusgrov

                                Gytis has a fix. You can track our progress via JIRA: [JBTM-2356] REST-AT recovery failure - JBoss Issue Tracker