-
30. failure situation using cluster-wide lock
manik Jan 11, 2011 12:03 PM (in response to yelin666)Also, I was working off branch 4.2.x, which includes some latest commits for 4.2.1. I was wondering whether any of these were relevant to your issue. https://issues.jboss.org/browse/ISPN-845 may be - it has to do with read-only transactions rolling back and is only specific to certain transaction managers. We didn't spot this with the DummyTM either.
I'll publish a snapshot of 4.2.1 which you can play with (4.2.1-SNAPSHOT in Maven), see if it helps.
-
31. failure situation using cluster-wide lock
yelin666 Jan 11, 2011 12:37 PM (in response to manik)I noticed a snapshot build published 1/11 around noon, and your commits for the above ticket were in 1/7. I assume that build would contain your changes. Please confirm if that's the right one to try.
-
32. failure situation using cluster-wide lock
manik Jan 11, 2011 12:53 PM (in response to yelin666)yes thats correct
-
33. Re: failure situation using cluster-wide lock
yelin666 Jan 12, 2011 11:23 PM (in response to manik)Manik,
I tried 4.2.1-SNAPSHOT build with the JBoss TX. It may be working better, but it's not completely fixed. I added more output to my code with information on the lock owner at different points as in my attached code. The following is the output snippet with the failure. Basically, it always happened following some TX warnings. Typically, the warning caused a rollback, and while the rollback was running successfully it failed to release the lock after the rollback. Hopefully, the following output can give you more hints on the problem. Meantime, I'll do a test with the Geromino TX used inside our container, and update you on that.
aquiring lock on cache test-grid key locked-counter...
lock owner : GlobalTransaction:<T00696119-12694>:11:local
current value : 17
commit...
done commit
lock owner : null
tx status at the end : no transaction
aquiring lock on cache test-grid key locked-counter...
lock owner : GlobalTransaction:<T00696119-12694>:12:local
current value : 19
commit...
done commit
lock owner : null
tx status at the end : no transaction
aquiring lock on cache test-grid key locked-counter...
2011-01-12 23:04:41,812 [Thread-4] WARN com.arjuna.ats.arjuna.logging.arjLoggerI18N - [com.arjuna.ats.arjuna.coordinator.TransactionReaper_18] - TransactionReaper::check timeout for TX 0:ffff0a2c2cb3:8fc:4d2e799d:32 in state RUN
2011-01-12 23:04:41,827 [Thread-5] WARN com.arjuna.ats.arjuna.logging.arjLoggerI18N - [com.arjuna.ats.arjuna.coordinator.BasicAction_58] - Abort of action id 0:ffff0a2c2cb3:8fc:4d2e799d:32 invoked while multiple threads active within it.
2011-01-12 23:04:41,827 [Thread-5] WARN com.arjuna.ats.arjuna.logging.arjLoggerI18N - [com.arjuna.ats.arjuna.coordinator.CheckedAction_2] - CheckedAction::check - atomic action 0:ffff0a2c2cb3:8fc:4d2e799d:32 aborting with 1 threads active!
2011-01-12 23:04:41,827 [Thread-5] WARN com.arjuna.ats.jta.logging.loggerI18N - [com.arjuna.ats.internal.jta.resources.arjunacore.rollbackerror] [com.arjuna.ats.internal.jta.resources.arjunacore.rollbackerror] attempted rollback of < formatId=131076, gtrid_length=29, bqual_length=28, tx_uid=0:ffff0a2c2cb3:8fc:4d2e799d:32, node_name=1, branch_uid=0:ffff0a2c2cb3:8fc:4d2e799d:33, eis_name=unknown eis name > (TransactionXaAdapter{localTransaction=org.infinispan.transaction.xa.LocalTransaction@221e9e}) failed with exception XAException.XAER_NOTA
javax.transaction.xa.XAException
at org.infinispan.transaction.xa.TransactionXaAdapter.getLocalTransactionAndValidate(TransactionXaAdapter.java:148)
at org.infinispan.transaction.xa.TransactionXaAdapter.rollback(TransactionXaAdapter.java:129)
at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:432)
at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:3337)
at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:3315)
at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1981)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:114)
at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:229)
at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:471)
at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:91)
2011-01-12 23:04:41,827 [Thread-5] WARN com.arjuna.ats.arjuna.logging.arjLoggerI18N - [com.arjuna.ats.arjuna.coordinator.TransactionReaper_7] - TransactionReaper::doCancellations worker Thread[Thread-5,5,main] successfully canceled TX 0:ffff0a2c2cb3:8fc:4d2e799d:32
lock owner : GlobalTransaction:<T00696119-12694>:13:local
rollback...Transaction TransactionImple < ac, BasicAction: 0:ffff0a2c2cb3:8fc:4d2e799d:32 status: ActionStatus.ABORTED > is not in a valid state to be invoking cache operations on.
2011-01-12 23:04:42,671 [Timer-0] WARN com.arjuna.ats.arjuna.logging.arjLoggerI18N - [com.arjuna.ats.arjuna.coordinator.BasicAction_40] - Abort called on already aborted atomic action 0:ffff0a2c2cb3:8fc:4d2e799d:32
done rollback
lock owner : GlobalTransaction:<T00696119-12694>:13:local
tx status at the end : no transaction
aquiring lock on cache test-grid key locked-counter...
rollback...Unable to acquire lock after [10 seconds] on key [locked-counter] for requestor [GlobalTransaction:<T00696119-12694>:14:local]! Lock held by [GlobalTransaction:<T00696119-12694>:13:local]
2011-01-12 23:04:56,640 [Timer-0] WARN com.arjuna.ats.jta.logging.loggerI18N - [com.arjuna.ats.internal.jta.resources.arjunacore.rollbackerror] [com.arjuna.ats.internal.jta.resources.arjunacore.rollbackerror] attempted rollback of < formatId=131076, gtrid_length=29, bqual_length=28, tx_uid=0:ffff0a2c2cb3:8fc:4d2e799d:36, node_name=1, branch_uid=0:ffff0a2c2cb3:8fc:4d2e799d:37, eis_name=unknown eis name > (TransactionXaAdapter{localTransaction=org.infinispan.transaction.xa.LocalTransaction@e45b5e}) failed with exception XAException.XAER_NOTA
javax.transaction.xa.XAException
at org.infinispan.transaction.xa.TransactionXaAdapter.getLocalTransactionAndValidate(TransactionXaAdapter.java:148)
at org.infinispan.transaction.xa.TransactionXaAdapter.rollback(TransactionXaAdapter.java:129)
at com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelAbort(XAResourceRecord.java:432)
at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:3337)
at com.arjuna.ats.arjuna.coordinator.BasicAction.doAbort(BasicAction.java:3315)
at com.arjuna.ats.arjuna.coordinator.BasicAction.Abort(BasicAction.java:1981)
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:114)
at com.arjuna.ats.arjuna.AtomicAction.abort(AtomicAction.java:200)
at com.arjuna.ats.internal.jta.transaction.arjunacore.TransactionImple.rollbackAndDisassociate(TransactionImple.java:1470)
at com.arjuna.ats.internal.jta.transaction.arjunacore.BaseTransaction.rollback(BaseTransaction.java:166)
at CacheScheduledCounter$CounterTask.run(CacheScheduledCounter.java:115)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
done rollback
lock owner : GlobalTransaction:<T00696119-12694>:13:local
tx status at the end : no transaction
-
34. Re: failure situation using cluster-wide lock
manik Jan 13, 2011 2:15 PM (in response to yelin666)Sorry, I still don't see this.
If you can consistently recreate it, could you pls try with:
- Replace the System.outs in your test with log(). I've added this method to implement log() to your test: http://pastie.org/1457389 (This will help your sys out messages to properly appear in log files, in correct order with other log messages).
- Set your log4j.xml file to use the following: http://pastie.org/1457395
- Make sure the 2 processes you launch log to separate files (use separate log4j.xml files pointing to separate file appenders)
And if you could send me the logs you generate, that would be great.
Thanks
Manik
-
35. Re: failure situation using cluster-wide lock
yelin666 Jan 14, 2011 3:59 AM (in response to manik)Hi Manik,
I followed your suggestions, and attached is the log of the instance that failed to release the lock properly. I hope it could help you nail down the problem.
Cheers,
Lin
-
locktest.log.zip 16.3 KB
-
-
36. Re: failure situation using cluster-wide lock
yelin666 Jan 14, 2011 3:15 PM (in response to yelin666)FYI, it still locked up with the Geromino TX running inside the servicemix container. So it seems for a production level TX with fully implemented commit/rollback, the lock didn't work properly. Please let me if you get any update on the issue, or need additional information from me.
-
37. failure situation using cluster-wide lock
manik Jan 14, 2011 3:45 PM (in response to yelin666)Hi Lin. JBoss TM is pretty "production-grade" too!
Still trying to debug this, it is tough when I cannot reproduce this but your logs really do help. Is there any chance you could enable thread naming in your logs? With that many interleaving threads, it helps. E.g., have a look at this log4j file:
https://github.com/maniksurtani/infinispan/blob/t_an1310_m/core/src/test/resources/log4j.xml
Modify it to use TRACE level logging as I mentioned above though.
Thx
Manik
-
38. failure situation using cluster-wide lock
manik Jan 14, 2011 4:06 PM (in response to manik)Also, pls use the latest 4.2.1 snapshot (I published a new version earlier today with a better fix to transaction cleanup)
-
39. Re: failure situation using cluster-wide lock
yelin666 Jan 16, 2011 10:35 PM (in response to manik)Manik,
Per your suggestion, I used the 4.2.1 snapshot you published Jan 14 14:47, and enabled thread naming that was in the second column of the logging output inside the "()". Attached please find the log on the node that failed to release the lock.
Thanks,
Lin
-
locktest.log.zip 12.8 KB
-
-
40. Re: failure situation using cluster-wide lock
manik Jan 17, 2011 11:21 AM (in response to yelin666)Good news - your most recent logs with thread logging info has enabled me to track the crux of the problem and I can reliably recreate it in a unit test. Expect a fix very soon.
-
41. Re: failure situation using cluster-wide lock
manik Jan 17, 2011 11:35 AM (in response to manik)Note though that this is not related to ISPN-777. So I've closed ISPN-777 and have created ISPN-879 instead.
-
42. failure situation using cluster-wide lock
manik Jan 17, 2011 12:27 PM (in response to manik)Fixed. New snapshot uploading right now.
-
43. Re: failure situation using cluster-wide lock
yelin666 Jan 18, 2011 9:18 AM (in response to manik)Great, Manik. I just tried the new snapshot you posted yesterday. The counter reached more than 900 without problem. Thanks for fixing it! When do you expect the 4.2.1.FINAL to be released for us to pick up?
A side question regard locking is - when a cluster-wide lock is requested, it is acquired on the local node first and then the request is sent to remote nodes to acquire it; a potential situation is that when multiple nodes require the same lock at the same time, none of them may be able to get it. Basically, in our use case, all nodes in the cluster would get an event, but we would like to use the cluster-wide lock to guarantee only one node, which acquires the lock, processes the event. In this case, all nodes get the event around the same time, and race for the lock, what strategy do you recommend so one of them could get the lock? If you would expect me to start a new thread for this discussion, please let me know.
-
44. failure situation using cluster-wide lock
manik Jan 18, 2011 10:15 AM (in response to yelin666)Hi Lin
I should be releasing 4.2.1.CR1 this week for you to try out. 4.2.1.FINAL should follow soon after, if all is well.
Regarding locks, you should start a new thread. There are a few strategies around this we can discuss.
Cheers
Manik