1 2 3 4 Previous Next 45 Replies Latest reply on Jan 18, 2011 10:25 AM by yelin666 Go to original post
      • 30. failure situation using cluster-wide lock
        manik

        Also, I was working off branch 4.2.x, which includes some latest commits for 4.2.1.  I was wondering whether any of these were relevant to your issue. https://issues.jboss.org/browse/ISPN-845 may be - it has to do with read-only transactions rolling back and is only specific to certain transaction managers.  We didn't spot this with the DummyTM either.

         

        I'll publish a snapshot of 4.2.1 which you can play with (4.2.1-SNAPSHOT in Maven), see if it helps.

        • 31. failure situation using cluster-wide lock
          yelin666

          I noticed a snapshot build published 1/11 around noon, and your commits for the above ticket were in 1/7. I assume that build would contain your changes. Please confirm if that's the right one to try.

          • 32. failure situation using cluster-wide lock
            manik

            yes thats correct

            • 33. Re: failure situation using cluster-wide lock
              yelin666

              Manik,

               

              I tried 4.2.1-SNAPSHOT build with the JBoss TX. It may be working better, but it's not completely fixed. I added more output to my code with information on the lock owner at different points as in my attached code. The following is the output snippet with the failure. Basically, it always happened following some TX warnings. Typically, the warning caused a rollback, and while the rollback was running successfully it failed to release the lock after the rollback. Hopefully, the following output can give you more hints on the problem. Meantime, I'll do a test with the Geromino TX used inside our container, and update you on that.

               

              aquiring lock on cache test-grid key locked-counter...

              lock owner : GlobalTransaction:<T00696119-12694>:11:local

              current value : 17

              commit...

              done commit

              lock owner : null

              tx status at the end : no transaction

              aquiring lock on cache test-grid key locked-counter...

              lock owner : GlobalTransaction:<T00696119-12694>:12:local

              current value : 19

              commit...

              done commit

              lock owner : null

              tx status at the end : no transaction

              aquiring lock on cache test-grid key locked-counter...

              2011-01-12 23:04:41,812 [Thread-4] WARN  com.arjuna.ats.arjuna.logging.arjLoggerI18N - [com.arjuna.ats.arjuna.coordinator.TransactionReaper_18] - TransactionReaper::check timeout for TX 0:ffff0a2c2cb3:8fc:4d2e799d:32 in state  RUN

              2011-01-12 23:04:41,827 [Thread-5] WARN  com.arjuna.ats.arjuna.logging.arjLoggerI18N - [com.arjuna.ats.arjuna.coordinator.BasicAction_58] - Abort of action id 0:ffff0a2c2cb3:8fc:4d2e799d:32 invoked while multiple threads active within it.

              2011-01-12 23:04:41,827 [Thread-5] WARN  com.arjuna.ats.arjuna.logging.arjLoggerI18N - [com.arjuna.ats.arjuna.coordinator.CheckedAction_2] - CheckedAction::check - atomic action 0:ffff0a2c2cb3:8fc:4d2e799d:32 aborting with 1 threads active!

              2011-01-12 23:04:41,827 [Thread-5] WARN  com.arjuna.ats.jta.logging.loggerI18N - [com.arjuna.ats.internal.jta.resources.arjunacore.rollbackerror] [com.arjuna.ats.internal.jta.resources.arjunacore.rollbackerror] attempted rollback of < formatId=131076, gtrid_length=29, bqual_length=28, tx_uid=0:ffff0a2c2cb3:8fc:4d2e799d:32, node_name=1, branch_uid=0:ffff0a2c2cb3:8fc:4d2e799d:33, eis_name=unknown eis name > (TransactionXaAdapter{localTransaction=org.infinispan.transaction.xa.LocalTransaction@221e9e}) failed with exception XAException.XAER_NOTA

              javax.transaction.xa.XAException

                      at org.infinispan.transaction.xa.TransactionXaAdapter.getLocalTransactionAndValidate(TransactionXaAdapter.java:148)

                      at org.infinispan.transaction.xa.TransactionXaAdapter.rollback(TransactionXaAdapter.java:129)

                      at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:432)

                      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:3337)

                      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:3315)

                      at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1981)

                      at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:114)

                      at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:229)

                      at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:471)

                      at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:91)

              2011-01-12 23:04:41,827 [Thread-5] WARN  com.arjuna.ats.arjuna.logging.arjLoggerI18N - [com.arjuna.ats.arjuna.coordinator.TransactionReaper_7] - TransactionReaper::doCancellations worker Thread[Thread-5,5,main] successfully canceled TX 0:ffff0a2c2cb3:8fc:4d2e799d:32

              lock owner : GlobalTransaction:<T00696119-12694>:13:local

              rollback...Transaction TransactionImple < ac, BasicAction: 0:ffff0a2c2cb3:8fc:4d2e799d:32 status: ActionStatus.ABORTED > is not in a valid state to be invoking cache operations on.

              2011-01-12 23:04:42,671 [Timer-0] WARN  com.arjuna.ats.arjuna.logging.arjLoggerI18N - [com.arjuna.ats.arjuna.coordinator.BasicAction_40] - Abort called on already aborted atomic action 0:ffff0a2c2cb3:8fc:4d2e799d:32

              done rollback

              lock owner : GlobalTransaction:<T00696119-12694>:13:local

              tx status at the end : no transaction

              aquiring lock on cache test-grid key locked-counter...

              rollback...Unable to acquire lock after [10 seconds] on key [locked-counter] for requestor [GlobalTransaction:<T00696119-12694>:14:local]! Lock held by [GlobalTransaction:<T00696119-12694>:13:local]

              2011-01-12 23:04:56,640 [Timer-0] WARN  com.arjuna.ats.jta.logging.loggerI18N - [com.arjuna.ats.internal.jta.resources.arjunacore.rollbackerror] [com.arjuna.ats.internal.jta.resources.arjunacore.rollbackerror] attempted rollback of < formatId=131076, gtrid_length=29, bqual_length=28, tx_uid=0:ffff0a2c2cb3:8fc:4d2e799d:36, node_name=1, branch_uid=0:ffff0a2c2cb3:8fc:4d2e799d:37, eis_name=unknown eis name > (TransactionXaAdapter{localTransaction=org.infinispan.transaction.xa.LocalTransaction@e45b5e}) failed with exception XAException.XAER_NOTA

              javax.transaction.xa.XAException

                      at org.infinispan.transaction.xa.TransactionXaAdapter.getLocalTransactionAndValidate(TransactionXaAdapter.java:148)

                      at org.infinispan.transaction.xa.TransactionXaAdapter.rollback(TransactionXaAdapter.java:129)

                      at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:432)

                      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:3337)

                      at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:3315)

                      at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1981)

                      at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:114)

                      at com.arjuna.ats.arjuna.AtomicAction.abort(AtomicAction.java:200)

                      at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.rollbackAndDisassociate(TransactionImple.java:1470)

                      at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.rollback(BaseTransaction.java:166)

                      at CacheScheduledCounter$CounterTask.run(CacheScheduledCounter.java:115)

                      at java.util.TimerThread.mainLoop(Timer.java:512)

                      at java.util.TimerThread.run(Timer.java:462)

              done rollback

              lock owner : GlobalTransaction:<T00696119-12694>:13:local

              tx status at the end : no transaction

              • 34. Re: failure situation using cluster-wide lock
                manik

                Sorry, I still don't see this. 

                 

                If you can consistently recreate it, could you pls try with:

                 

                1. Replace the System.outs in your test with log().  I've added this method to implement log() to your test: http://pastie.org/1457389 (This will help your sys out messages to properly appear in log files, in correct order with other log messages).
                2. Set your log4j.xml file to use the following: http://pastie.org/1457395
                3. Make sure the 2 processes you launch log to separate files (use separate log4j.xml files pointing to separate file appenders)

                 

                And if you could send me the logs you generate, that would be great.

                 

                Thanks

                Manik

                • 35. Re: failure situation using cluster-wide lock
                  yelin666

                  Hi Manik,

                   

                  I followed your suggestions, and attached is the log of the instance that failed to release the lock properly. I hope it could help you nail down the problem.

                   

                  Cheers,

                  Lin

                  • 36. Re: failure situation using cluster-wide lock
                    yelin666

                    FYI, it still locked up with the Geromino TX running inside the servicemix container. So it seems for a production level TX with fully implemented commit/rollback, the lock didn't work properly. Please let me if you get any update on the issue, or need additional information from me.

                    • 37. failure situation using cluster-wide lock
                      manik

                      Hi Lin.  JBoss TM is pretty "production-grade" too!

                       

                      Still trying to debug this, it is tough when I cannot reproduce this but your logs really do help.  Is there any chance you could enable thread naming in your logs?  With that many interleaving threads, it helps.  E.g., have a look at this log4j file:

                       

                      https://github.com/maniksurtani/infinispan/blob/t_an1310_m/core/src/test/resources/log4j.xml

                       

                      Modify it to use TRACE level logging as I mentioned above though.

                       

                      Thx

                      Manik

                      • 38. failure situation using cluster-wide lock
                        manik

                        Also, pls use the latest 4.2.1 snapshot (I published a new version earlier today with a better fix to transaction cleanup)

                        • 39. Re: failure situation using cluster-wide lock
                          yelin666

                          Manik,

                           

                          Per your suggestion, I used the 4.2.1 snapshot you published Jan 14 14:47, and enabled thread naming that was in the second column of the logging output inside the "()". Attached please find the log on the node that failed to release the lock.

                           

                          Thanks,

                          Lin

                          • 40. Re: failure situation using cluster-wide lock
                            manik

                            Good news - your most recent logs with thread logging info has enabled me to track the crux of the problem and I can reliably recreate it in a unit test.  Expect a fix very soon. 

                            • 41. Re: failure situation using cluster-wide lock
                              manik

                              Note though that this is not related to ISPN-777.  So I've closed ISPN-777 and have created ISPN-879 instead.

                              • 42. failure situation using cluster-wide lock
                                manik

                                Fixed.  New snapshot uploading right now.

                                • 43. Re: failure situation using cluster-wide lock
                                  yelin666

                                  Great, Manik. I just tried the new snapshot you posted yesterday. The counter reached more than 900 without problem. Thanks for fixing it! When do you expect the 4.2.1.FINAL to be released for us to pick up?

                                   

                                  A side question regard locking is - when a cluster-wide lock is requested, it is acquired on the local node first and then the request is sent to remote nodes to acquire it; a potential situation is that when multiple nodes require the same lock at the same time, none of them may be able to get it. Basically, in our use case, all nodes in the cluster would get an event, but we would like to use the cluster-wide lock to guarantee only one node, which acquires the lock, processes the event. In this case, all nodes get the event around the same time, and race for the lock, what strategy do you recommend so one of them could get the lock? If you would expect me to start a new thread for this discussion, please let me know.

                                  • 44. failure situation using cluster-wide lock
                                    manik

                                    Hi Lin

                                     

                                    I should be releasing 4.2.1.CR1 this week for you to try out.  4.2.1.FINAL should follow soon after, if all is well.

                                     

                                    Regarding locks, you should start a new thread.  There are a few strategies around this we can discuss. 

                                     

                                    Cheers

                                    Manik