-
1. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
ochaloup Dec 11, 2015 5:04 AM (in response to tomjenkinson)Hi Tom,
for interest - will be the scenario discussed in this bz https://bugzilla.redhat.com/show_bug.cgi?id=1009981 influenced by this change?
I mean will be still need to set different values of `com.arjuna.ats.jta.orphanSafetyInterval`for system under heavy load or the clash of the titans scenario will be avoided?
Thanks
Ondra
-
2. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
marklittle Dec 11, 2015 6:18 AM (in response to tomjenkinson)Trying to make a distributed system look and work like a local system is something that can never really be accomplished and often leads to misunderstandings by developers and users who expect one thing and get something very different. If you feel that adding this feature helps users then go for it and just make sure the edge cases are well documented. Since this represents a difference in public interface/behaviour, it should also not be the default in EAP 6.x so you may want to consider an "enable" option - make it false for any TS version that goes into EAP 6.x and true for EAP 7.x so people can go back to the old behaviour if they really really want to
-
3. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
tomjenkinson Dec 11, 2015 7:16 AM (in response to ochaloup)Ondřej Chaloupka wrote:
for interest - will be the scenario discussed in this bz https://bugzilla.redhat.com/show_bug.cgi?id=1009981 influenced by this change?
I mean will be still need to set different values of `com.arjuna.ats.jta.orphanSafetyInterval`for system under heavy load or the clash of the titans scenario will be avoided?
orphanSafetyInterval would still be required for the remote case for situations where an orphan exists but the recovery manager cannot contact the transaction manager (or TSM is disabled - the default for WildFly). For the purely local case orphanSafetyInterval should be able to be disabled.
-
4. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
tomjenkinson Dec 11, 2015 7:17 AM (in response to marklittle)Mark Little wrote:
Trying to make a distributed system look and work like a local system is something that can never really be accomplished and often leads to misunderstandings by developers and users who expect one thing and get something very different. If you feel that adding this feature helps users then go for it and just make sure the edge cases are well documented. Since this represents a difference in public interface/behaviour, it should also not be the default in EAP 6.x so you may want to consider an "enable" option - make it false for any TS version that goes into EAP 6.x and true for EAP 7.x so people can go back to the old behaviour if they really really want to
Agreed, I would target this for EAP 7.0 with a doc update. If the feature was requested in earlier versions it would not be the default.
-
5. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
tomjenkinson Dec 11, 2015 7:17 AM (in response to tomjenkinson)As there looks to be concensus that it is not broken fundamentally I will raise a Jira to record further progress.
-
6. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
tomjenkinson Dec 11, 2015 7:24 AM (in response to tomjenkinson)I have recorded this enhancement [JBTM-2583] Try to contact the transaction status connection manager to determine if a transaction containing XAResource… and linked in our discussion.
-
7. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
ochaloup Dec 11, 2015 7:54 AM (in response to tomjenkinson)tomjenkinson: Ok, I see the point here only and understand that orphanSafetyInterval can't be avoided totally. But mentioning " For the purely local case orphanSafetyInterval should be able to be disabled." the EAP case is the purerly local isn't it? The recovery manager is always expected to be run in the same JVM as TM does, right?
Btw. is the option will be set to true in EAP7 by default then TSM will be enabled by default as well, right?
-
8. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
marklittle Dec 11, 2015 8:57 AM (in response to tomjenkinson)I'm not convinced this should be the default in EAP 7 either - there's an argument to be made in either direction. However, as long as it is documented and flagged as a difference in behaviour ...
-
9. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
tomjenkinson Dec 11, 2015 9:43 AM (in response to marklittle)I agree - lets not make it the default in any version for now.
Enabling TSM by default would be a hard change to push into EAP7 at this stage, imo so long as the facility is there it should be enough for our users to enable it with appropriate doc support.
-
10. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
ochaloup Dec 11, 2015 11:07 AM (in response to tomjenkinson)I see, thank you for clarification.
-
11. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
alon3392 Jan 18, 2018 2:38 AM (in response to tomjenkinson)This issue (JBTM-2583) is marked as fixed in 5.2.17, while the latest EAP 7.1 contains 5.5.30 (as documented here). However the latest EAP 7.1 documentation under Limitations of the XA Recovery Process still has an entry
"Periodic recovery can occur on committed transactions" and provides link to this thread.
Is it an issue with the EAP documentation being outdated, or this is really fixed now in EAP 7.1?
Also, the above raised question seems to be unanswered: if this fix is included and enabled, whether orphanSafetyInterval is still needed in the EAP case (purely local)?
Also, this fix includes XAResourceOrphanFilter implementation called JTAActionStatusServiceXAResourceOrphanFilter. Is it enabled by default in EAP 7.1 or has to be manually configured?
Thanks
-
12. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
tomjenkinson Jan 18, 2018 8:02 AM (in response to alon3392)Hi Alon,
JBTM-2583 does further remove the chance of this happening but does not completely eliminate it.
It is possible that the RecoveryManager could scan XAResources for Xids, then a transaction that matches to one of these TX be commited (and therefore forgotten about) and the RecoveryManager when it comes to decide whether an Xid is an orphan it can't find the transaction so tries to roll it back (which will fail).
We could after committing each transaction submit the transaction ID to the recovery manager (assuming in process) for it to check (for the current two phase scan) during orphan detection in the second phase - it should eliminate that message then.
Does it cause you a problem? If so we can look to schedule the work or work with you on a PR?
Thanks,
Tom
-
13. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
alon3392 Jan 22, 2018 1:46 AM (in response to tomjenkinson)Hi Tom,
having these spurious warnings in the log is definitely not ideal.
Here is my understanding:
Without JTAActionStatusServiceXAResourceOrphanFilter (introduced in JBTM-2583) enabled, recovery manager orphan detection only checks transaction log and can thus incorrectly rollback transactions not yet written to the log, which happens if the time it takes to prepare all XAResources is greater than orphanSafetyInterval. Since Narayana writes transaction log entry (TxLog.write_committed) only once after all XA resources finished prepare operation, chance of violation happening increases with the number of enlisted XAResources, and the time each resource takes to finish prepare operation.
Increasing orphanSafetyInterval only reduces chance of data integrity violation, while also increasing data unavailability after recovering from crash (XAResource's data is locked for longer time)
With JTAActionStatusServiceXAResourceOrphanFilter enabled, increasing orphanSafetyInterval now only affects a chance of spurious warnings on rollback happening (while still of course affecting lock time after crash)
I would be glad to work on a PR for this issue.
Apart from fixing the issue with the approach you described where recovery manager would know that XID doesn't exist anymore, the question is why we issue warning in the first place? Basically we're telling XAResource: do rollback if this XID exists. So receiving XAER_NOTA is not really an error and could be logged with debug level instead of warning? Currently [XARecoveryModule.java#L849] the warning is issued for all return codes.
Then the approach you described would be only an optimization to avoid network call and load on RM.
Also, regarding JTAActionStatusServiceXAResourceOrphanFilter, while this filter is by default present in EAP and Wildfly there are still quick-start examples [transaction.xml#L117] or default configurations, like in Spring Boot [NarayanaProperties.java#L102] that don't use it.
Given the chances of data integrity issues in the absence of JTAActionStatusServiceXAResourceOrphanFilter, there is no reason not to include it by default?
-
14. Re: Whether to attempt to query the transaction manager rather than rely solely on orphan detection
tomjenkinson Jan 22, 2018 7:00 AM (in response to alon3392)Hi Alon,
It will not affect lock time as the branch has been completed prior to the recovery manager issuing the spurious rollback.I think it might be best just to do as you suggested regarding degrading an XA_NOTA to a debug message.
If you want to do that we would welcome a JBTM and pull request? You should sign CLA before your PR is merged: Contributor License Agreements - JBoss Community (JBoss Transactions)
Finally, I agree those other places should be updated too. We can't make it a default in the sense that no orphan filters are default, though it is of course very sensible to make sure our examples and default configuration use this. If you want to PR those please do or I can do the same also. Feel free to use a commit message prefixed with JBTM-2583.
Many thanks!
Tom