Design Discussion: Changing the reaper to use s...| JBoss.org Content Archive (Read Only)

60. Re: Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

tomjenkinson Aug 15, 2014 12:07 PM (in response to smarlow)

Scott Marlow wrote:

Unless it is an error in the user logic I can't see a scenario where the application thread won't call one of commit/rollback/suspend on the BaseTransactionManagerDelegate?

Yes, the application logic could have a bug in their user transaction handling logic, this is easy to get wrong. For example, the application might notice that the transaction status is STATUS_ROLLEDBACK and leave the rolled back transaction associated with the thread (causing a leak of the database connection and persistence context I think). This could be a very large application that seemed to work before with the bug but no longer functions as it used to.

I don't think this is likely as if they call begin() on the same thread again then they will get an error as it stands anyway, so its likely that if they are ignoring the status already they will be having other difficulties with their app.

61. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

smarlow Aug 15, 2014 1:21 PM (in response to tomjenkinson)

The application would only get an error when it performs another operation on the same thread again (which still has the leaked transaction associated with it). I know its not a robustly written application that would do this but it is a possible use case to consider.

Are there certain solutions (1-4 from the recent summary screen), that you are in support of? You keep returning to this one solution, as if that is the only one but there are others also.

62. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

smarlow Aug 19, 2014 3:24 PM (in response to smarlow)

I was able to simulate the race condition that could happen, without actually recreating the race (which would be more difficult). The race is between the clearing of the persistence context in the background thread (performed by persistence provider Synchronization.afterCompletion(int) callback) and the application thread adding a new entity to the persistence context after. Might be easier to recreate the actual race with ByteMan at some point.

https://gist.github.com/scottmarlow/0151c9ac776f090e3f43 shows the test output from a change I pushed to https://github.com/scottmarlow/wildfly/tree/transactiontimeout_clientut_noejb.

63. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

tomjenkinson Aug 21, 2014 2:39 PM (in response to smarlow)

Hi Scott,

As you say, I believe option 4 is the best solution for the reasons I have identified before. I.E. it should be possible for Synchronizations to be totally in control of whether they concurrently access any internal state as they will know when both the tx completes and the app thread is associated. As you say, it would be possible for a UT to delay calling commit/suspend/rollback but if they ever called begin on that thread again their app would blow up anyway so I don't think there are going to be too many apps where that would be possible as they would be crashing all the time.

Option 1: Unless we can work out a way to avoid a concurrent rollback in the remote JVM I don't see this as any more advantageous than issuing a concurrent rollback from the reaper. If we could come up with a solution for that I would be happy to reconsider it.

Option 2: Its actually quite close to option 4 but where a TSR proxy has to track thread usage so I think its too complex.

Option 3: I understand the proposal but I think option 4 would resolve this issue satisfactorily.

Can you have a think about option 4 and let me know under what scenarios you think it would fail? If its just an app thread not calling commit/rollback/suspend then I think that is an app bug and one we would detect immediately the next time the app thread called "begin" as it would blow up (we could call the tx disassociation listener too at that point anyway).

Tom

64. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

tomjenkinson Aug 21, 2014 2:41 PM (in response to smarlow)

Scott Marlow wrote:

I was able to simulate the race condition that could happen, without actually recreating the race (which would be more difficult). The race is between the clearing of the persistence context in the background thread (performed by persistence provider Synchronization.afterCompletion(int) callback) and the application thread adding a new entity to the persistence context after. Might be easier to recreate the actual race with ByteMan at some point.

https://gist.github.com/scottmarlow/0151c9ac776f090e3f43 shows the test output from a change I pushed to https://github.com/scottmarlow/wildfly/tree/transactiontimeout_clientut_noejb.

Thanks Scott, if you were using option 4 above your afterCompletion would not have fired till the app thread returned so it should have been OK. Even as it stands though I think what you have in the log is acceptable as the transaction timed out and then the app thread gets an error message because the persistence context is closed? Or maybe I misunderstood?

65. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

smarlow Aug 25, 2014 11:02 AM (in response to tomjenkinson)

Tom Jenkinson wrote:

Hi Scott,

As you say, I believe option 4 is the best solution for the reasons I have identified before. I.E. it should be possible for Synchronizations to be totally in control of whether they concurrently access any internal state as they will know when both the tx completes and the app thread is associated. As you say, it would be possible for a UT to delay calling commit/suspend/rollback but if they ever called begin on that thread again their app would blow up anyway so I don't think there are going to be too many apps where that would be possible as they would be crashing all the time.

Regarding compatibility (with existing application), think about the application that skips calling commit/suspend/rollback because they see that the transaction status is rolled back. That application will not normally "blow up". Option 4 is also not standards based, as it requires the persistence provider(s) to code to a JBoss api (not likely to happen in older versions of Hibernate and other 3rd party persistence providers).

Option 1: Unless we can work out a way to avoid a concurrent rollback in the remote JVM I don't see this as any more advantageous than issuing a concurrent rollback from the reaper. If we could come up with a solution for that I would be happy to reconsider it.
Option 2: Its actually quite close to option 4 but where a TSR proxy has to track thread usage so I think its too complex.

Option 3: I understand the proposal but I think option 4 would resolve this issue satisfactorily.

For option 1, this is a distributed problem which I think we have anyway (separate nodes joined to a transaction can crash independently). There is more than one solution, we just need to pick one approach to this problem. Or, if we need multiple approaches, we can configure which one to use.

I agree that option 2 is too complex, as the application server probably doesn't have enough information to fully implement all of the current/future Synchronization logic. The positive side of option 2 is that it is standards based. It would work with older versions of Hibernate and 3rd party persistence providers.

Regarding option 3, the issue isn't really resolved, as much as we could consider marking the application persistence context as not usable after a background transaction roll back. This would need to be discussed on the JPA expert group (to figure out what applications should expect which we didn't cover deep enough previously).

IMO, It sounds to me like the best options is still #1 (assuming the distributed crash problem could be handled). #3 would be good to consider for EE 8, as an independent path (perhaps on JPA expert group and then on JTA, or both at the same time is also fine).

66. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

tomjenkinson Aug 26, 2014 6:31 AM (in response to smarlow)

Tom Jenkinson wrote:

Hi Scott,

As you say, I believe option 4 is the best solution for the reasons I have identified before. I.E. it should be possible for Synchronizations to be totally in control of whether they concurrently access any internal state as they will know when both the tx completes and the app thread is associated. As you say, it would be possible for a UT to delay calling commit/suspend/rollback but if they ever called begin on that thread again their app would blow up anyway so I don't think there are going to be too many apps where that would be possible as they would be crashing all the time.

Regarding compatibility (with existing application), think about the application that skips calling commit/suspend/rollback because they see that the transaction status is rolled back. That application will not normally "blow up". Option 4 is also not standards based, as it requires the persistence provider(s) to code to a JBoss api (not likely to happen in older versions of Hibernate and other 3rd party persistence providers).

Hi Scott,

As soon as they began a new transaction on that thread it would blow up so for example, if you write the following app today:

for (int i = 0; i < 2; i++)

ut.begin()

// wait for timeout

if (ut.getStatus() != STATUS_NO_TRANSACTION) {

ut.complete()

}

It will fail the second loop as there is a tx on the thread. With Narayana you have to (and always have had to) disassociate the transaction with a commit/rollback/suspend.

I agree that option 2 is too complex, as the application server probably doesn't have enough information to fully implement all of the current/future Synchronization logic. The positive side of option 2 is that it is standards based. It would work with older versions of Hibernate and 3rd party persistence providers.

I can probably implement 2 using the API defined in option 4, we can then use configuration to specify which synchronization class names need the app thread disassociating in standalone.xml. Would that help?

The thing I don't understand about option 1 is if we are going to have to issue a potentially concurrent rollback, what would be the difference between that and just say configuring the reaper to issue rollback after timeout x2 for example? You say they are multiple solutions for this but I can't picture any (that don't require concurrent rollback). To be clear, lets define this as option 5.

Option 5: Always set transaction timeout on the XAResources to the value specified by the user (or default). Allow a configurable offset for the reaper to issue a concurrent rollback. Add check for real timeout during commit. Apart from the sRO this is what I think you are actually arguing for in option 1 as sRO doesn't really have much affect apart from preventing commit. Potentially we could even call sRO after the real timeout but I don't think its necessarily worth complicating the solution for that. Effectively this is just adding a delay to the concurrent rollback call (which is why I honestly don't think that the hibernate team would like this as it really is just a configurable delay before concurrently touching the EM.close, but we can ask their opinion if you agree this is what you are asking for).

Tom

67. Re: Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

smarlow Aug 26, 2014 1:07 PM (in response to tomjenkinson)

Hi Tom,

As soon as they began a new transaction on that thread it would blow up so for example, if you write the following app today:

for (int i = 0; i < 2; i++)

ut.begin()

    // wait for timeout

    if (ut.getStatus() != STATUS_NO_TRANSACTION) {

      ut.complete()

    }

}

It will fail the second loop as there is a tx on the thread. With Narayana you have to (and always have had to) disassociate the transaction with a commit/rollback/suspend.

I don't disagree with your thinking that the application needs to disassociate the transaction with a commit/rollback/suspend. I'm just pointing out that their application might of seemed to work correct correctly with:

ut.begin()
// wait for timeout
if (ut.getStatus() != STATUS_ROLLED_BACK && ut.getStatus() != STATUS_COMMITTED) {
  ut.complete()
}

The above code contains a bug but that is what I had in mind. If we dismiss this buggy code example, we are probably okay. Question is whether we should design for this case or not. I'm not sure if there are other similar cases that might be less buggy.

I can probably implement 2 using the API defined in option 4, we can then use configuration to specify which synchronization class names need the app thread disassociating in standalone.xml. Would that help?

I think that will help and could also be an interesting pattern to bring into the standards (based on our prototype).

The thing I don't understand about option 1 is if we are going to have to issue a potentially concurrent rollback, what would be the difference between that and just say configuring the reaper to issue rollback after timeout x2 for example? You say they are multiple solutions for this but I can't picture any (that don't require concurrent rollback). To be clear, lets define this as option 5.

Option 5: Always set transaction timeout on the XAResources to the value specified by the user (or default). Allow a configurable offset for the reaper to issue a concurrent rollback. Add check for real timeout during commit. Apart from the sRO this is what I think you are actually arguing for in option 1 as sRO doesn't really have much affect apart from preventing commit. Potentially we could even call sRO after the real timeout but I don't think its necessarily worth complicating the solution for that. Effectively this is just adding a delay to the concurrent rollback call (which is why I honestly don't think that the hibernate team would like this as it really is just a configurable delay before concurrently touching the EM.close, but we can ask their opinion if you agree this is what you are asking for).

The only alternative that we have identified is to do the background rollback or terminate the remaining JVM process(es). Another approach could be to log an error from the reaper thread. Yet another approach could be to invoke a user defined callback that can issue an administrative alert that the database connection should be killed from the database server side (or something like that).

For option 5, setting an XAResource level timeout sounds like another good approach (if that is supported by the resource manager). Could we do both, first call SRO after the transaction times out and also set the transaction timeout on each XAResource to a short delay? Not sure if we should ignore errors thrown when attempting to set the XAResource level timeout?

Scott

68. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

marklittle Sep 2, 2014 10:54 AM (in response to smarlow)

Has anyone done a cost/benefit analysis? For instance, are we talking about 99% of applications that would benefit from such a change, or just 1%? Of those applications that may benefit, are we talking about ones that are buggy or ones that are well defined/implemented? Before embarking on anything, we should summarise this here, especially after 67 comments

69. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

smarlow Sep 3, 2014 12:33 PM (in response to marklittle)

Has anyone done a cost/benefit analysis? For instance, are we talking about 99% of applications that would benefit from such a change, or just 1%? Of those applications that may benefit, are we talking about ones that are buggy or ones that are well defined/implemented? Before embarking on anything, we should summarise this here, especially after 67 comments

I think this partly depends on where the (JPA) EE 8 discussion goes. For JPA 2.1, we identified that the entity manager must not be concurrently called by multiple threads but didn't address the deeper ordering issue that some applications could see (e.g. application thread could add entities to the persistence context after the transaction is rolled back). From a cost perspective, we have had users run into issues with background transaction timeouts. I think that in some cases, it is likely to be a buggy application that experiences transaction time-out (due to access of resources out of order). However, even for well written applications, they could also see transaction time-outs when multiple applications share the same database tables. I'm not sure of the applications that do experience transaction time-out, what percentage would require that we handle the distributed failure case. Depending on how the distributed failure with multiple JVMs participating in the timed out transaction, is handled, the cost will vary.

Some possible associated tasks to add cost numbers for:

WildFly configuration changes to allow a choice between background timeout and set rollback only.
Adding a new TransactionSynchronizationRegistry method wasTimedOut that could be called from a Synchronization.afterCompletion(int) to check if the active transaction (application or background reaper) timed out. It is not clear yet if this is a requirement. I will defer to the JPA eg discussion about what action should be taken once it is determined that a transaction timed out.
Narayana configuration change for reaper thread to choose either background rollback or set roll-back only.
Documentation changes and test changes for the above. We need to document the advantage of both (SRO/BR options.) BR is background roll-back.

A related question might be, what is the cost to applications if we don't improve how transaction timeout is handled. I think the worst case scenarios are that unexpected Java exceptions are thrown. Some applications could see invalid state (more likely under CMT that an unexpected exception is thrown). In my local testing, I have only seen invalid application state with my badly written test that purposely adds an entity to the persistence context in a way that sees invalid state (I could easily change my test code to check the transaction state and avoid the invalid state).

70. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

mmusgrov Feb 20, 2015 10:03 AM (in response to smarlow)

There is a consensus favouring option 4 (transaction association listener), accordingly I have opened https://issues.jboss.org/browse/JBTM-2342 to track our progress on the implementation.

71. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

marklittle Feb 24, 2015 7:02 AM (in response to mmusgrov)

Will there be a separate discussion on the API?

72. Re: Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

mmusgrov Feb 24, 2015 9:11 AM (in response to marklittle)

The API proposed so far is:

Option 1)

public interface ThreadAssociationListener extends Synchronization {

public void threadAssociated();

public void threadDisassociated();

}

A somewhat improved interface exposes a simple extensible state model:

Option 2)

interface ExtendedTransactionListener {

void stateChange(TransactionEvent e);

}

enum TransactionEvent {

ASSOCIATED,

DISASSOCIATED

}

Here we shy away from adding any state changes that are already provided by javax.transaction.Synchronization.

This option is nice because:

- it is sufficient to addresses the problem in hand, and

- it is extensible in the event of future requirements

Jesper argues for a more generic approach (to avoid having to revisit it if future requirements crop up in this area):

Option 3)

Something along the lines of:

interface TransactionLifecycleListener {

void before(TransactionLifecycleEvent e);

void after(TransactionLifecycleEvent e);

}

enum TransactionLifecycleEvent {

ASSOCIATED,

START,

PREPARE, NO_PREPARE /* in case there is no prepare() call */

...,

BEFORE_COMPLETION,

AFTER_COMPLETION,

DISASSOCIATED

}

The TransactionLifecycleListener methods could even take a Object... payload parameter such that the implementation could do something like:

after:

if (ENLIST.equals(e)) {

XAResource xares = payload[0];

// do stuff

}

This option needs further discussion:

- why do we need before and after

- why do we need the extra lifecycle callbacks (such as the implied ENLIST event, what is START), some use cases would be desirable;

- should we be duplicating lifecycle callbacks that are already available via javax.transaction.synchronization.

73. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

mmusgrov Feb 24, 2015 9:53 AM (in response to mmusgrov)

Scott summarises the cases that the implementation needs to cover:

Common case, after completion + txn disassociated callback are both called from the application thread;
Common tx timeout case, after completion called from reaper thread and txn disassociated callback called on the application thread;
Multiple JVM timeout case, some JVMs may see (2) from above but some may also see after completion called from communications thread.

Tom produced the following SD to describe one (of many) possible 2 JVM scenarios with the transaction timing out on the second JVM which helps visualize what's happening in the more complex cases:

74. Re: Design Discussion: Changing the reaper to use setRollbackOnly() instead of rollback()

marklittle Feb 25, 2015 4:19 AM (in response to mmusgrov)

This thread is waaaaayyyy too long now and the discussion on the SPI should really be in a separate thread. Please move things over and let's keep this thread focussed on the original topic.