Managing BPEL invocation jobs in Switchyard
objectiser Aug 26, 2011 10:14 AMWhen I initially embedded a BPEL process into switchyard the current transaction/scheduling model worked fine. When I added two BPEL processes, with one invoking the other, it also appeared to work fine - until I realised that the BPEL processes were using ODE's internal peer-to-peer communications rather than the SCA service reference to handle the invocation between the separate processes (implemented using BPEL).
Once I had disabled the peer-to-peer communications, so that the invocation was routed back through switchyard to the second service (implementing the second BPEL process), I found that there was an attempt to begin a nested transaction, as the ODE engine is expecting to handle the outer most transaction - and I guess for efficiency reasons switchyard invokes the second service in the same thread as the first.
Therefore the first ODE change I made to was take into account whether the invocation was being performed in an existing transaction, and therefore not attempt to begin/commit or rollback the transaction.
This led to the next problem - ODE does not directly invoke the process instance, it places a job onto a scheduler queue which is then handled in a separate thread (or possible server if the job is persisted). However the job is only executed when the transaction in which it is scheduled gets committed. To illustrate the problem further, the ODE 'invoke' method has the following structure:
(if no outer txn) start txn
send request
(if no outer txn) commit txn
if (response expected) {
(if no outer txn) start txn
receive response
(if no outer txn) commit txn
}
If an outer transaction is already active, then the invoke method does not commit the transaction prior to receiving the response, and therefore the job to handle the request is never executed, and therefore the response is never received. Eventually a timeout occurs to break the deadlock.
To overcome this problem I did some further ODE modifications to enable the job to be executed immediately, bypassing the scheduler. This enables the first and second BPEL process instances to be executed within the same transaction, avoiding any unnecessary transaction, persistence and job scheduling - so should be more efficient.
This approach also better fits the all-or-nothing approach that should be associated with an SCA service invocation, rather than the incremental step by step approach of a BPEL engine - which ultimately needs to respond back to a waiting client anyway.
The issue is how does this fit with the clustering/failover/load balancing capabilities that would be required in the future, and that are provided with RiftSaw 2 when running in the app server directly.
Transaction boundaries
When performing a set of business activities within the same switchyard app invocation, where multiple BPEL processes and/or transactional resources (dbs, messaging, etc) are involved, we need to coordinate these activities in the scope of a single transaction. With the recent changes, ODE will only attempt to start a transaction if one does not exist, but ideally there should be a higher level way to indicate that a binding should start a transaction???
The other issue that needs to be addressed is ensuring that the BPEL component hooks into the appropriate transaction manager, especially when switchyard is deployed to jbossas.
Clustering and Failover
In RiftSaw2, the clustering and failover is based on distribution of requests to a set of servers all configured with the same set of BPEL processes, and using the same relational database under the covers. The mechanism revolves around the job scheduling mechanism, to manage jobs associated with nodes in the cluster, and relocate those jobs when notified that a node is no longer available.
In the Switchyard integration, with the modifications discussed above, the use of jobs and the scheduler is no longer relevant in this situation. Therefore the BPEL process execution is simply one part of the processing in the pipeline associated with the invocation of an SCA application/service.
As long as the transaction boundary of an invocation covers the lifecycle of the invoke, and can therefore rollback all activity performed within its scope, the clustering/failover becomes an issue for switchyard. If a service request fails, e.g. due to a node failure, it should be possible to re-issue the request to switchyard (running on a different node) and have that request run successful, assuming that the BPEL component on both of the servers is configured to use the same database.
Basic Scheduler
Although the invocation approach mentioned above no longer utilitises the scheduler, there are still cases when a scheduler may be required, such as with wait states.
However when a BPEL process is used in the context of an SCA app, its individual invocations should only exist within the lifetime of a single invocation - for example, a process instance should not be able to wait for 5 hours before returning a response, and expect the switchyard invocation context in which it was called to still be available. So the individual invocations still need to exist within a reasonably short time frame. This is actually no different to the current Riftsaw2 version, as individual request/response operations need to be performed within the timeout window associated with a web service invocation.
We need to consider further test cases, of long running process instances, to see whether any issues may arise, but one case that comes to mind would be:
"A BPEL process is invoked and immediately returns a response, but then invokes another external service (possibly after a wait interval). The initial req/resp would have been handled, and using the scheduler approach the client would receive the response while the engine continued to run the remaining aspects of the process instance. In SCA (and switchyard), all activity would be expected to occur within the scope of the req/resp."