3 Replies Latest reply on Aug 31, 2011 1:32 AM by jeff.yuchang

    Managing BPEL invocation jobs in Switchyard

    objectiser

      When I initially embedded a BPEL process into switchyard the current transaction/scheduling model worked fine. When I added two BPEL processes, with one invoking the other, it also appeared to work fine - until I realised that the BPEL processes were using ODE's internal peer-to-peer communications rather than the SCA service reference to handle the invocation between the separate processes (implemented using BPEL).

       

      Once I had disabled the peer-to-peer communications, so that the invocation was routed back through switchyard to the second service (implementing the second BPEL process),  I found that there was an attempt to begin a nested transaction, as the ODE engine is expecting to handle the outer most transaction - and I guess for efficiency reasons switchyard invokes the second service in the same thread as the first.

       

      Therefore the first ODE change I made to was take into account whether the invocation was being performed in an existing transaction, and therefore not attempt to begin/commit or rollback the transaction.

       

      This led to the next problem - ODE does not directly invoke the process instance, it places a job onto a scheduler queue which is then handled in a separate thread (or possible server if the job is persisted). However the job is only executed when the transaction in which it is scheduled gets committed. To illustrate the problem further, the ODE 'invoke' method has the following structure:

       

      (if no outer txn) start txn

      send request

      (if no outer txn) commit txn

      if (response expected) {

          (if no outer txn) start txn

          receive response

          (if no outer txn) commit txn

      }

       

      If an outer transaction is already active, then the invoke method does not commit the transaction prior to receiving the response, and therefore the job to handle the request is never executed, and therefore the response is never received. Eventually a timeout occurs to break the deadlock.

       

      To overcome this problem I did some further ODE modifications to enable the job to be executed immediately, bypassing the scheduler. This enables the first and second BPEL process instances to be executed within the same transaction, avoiding any unnecessary transaction, persistence and job scheduling - so should be more efficient.

       

      This approach also better fits the all-or-nothing approach that should be associated with an SCA service invocation, rather than the incremental step by step approach of a BPEL engine - which ultimately needs to respond back to a waiting client anyway.

       

      The issue is how does this fit with the clustering/failover/load balancing capabilities that would be required in the future, and that are provided with RiftSaw 2 when running in the app server directly.

       

      Transaction boundaries

       

      When performing a set of business activities within the same switchyard app invocation, where multiple BPEL processes and/or transactional resources (dbs, messaging, etc) are involved, we need to coordinate these activities in the scope of a single transaction. With the recent changes, ODE will only attempt to start a transaction if one does not exist, but ideally there should be a higher level way to indicate that a binding should start a transaction???

       

      The other issue that needs to be addressed is ensuring that the BPEL component hooks into the appropriate transaction manager, especially when switchyard is deployed to jbossas.

       

       

      Clustering and Failover

       

      In RiftSaw2, the clustering and failover is based on distribution of requests to a set of servers all configured with the same set of BPEL processes, and using the same relational database under the covers. The mechanism revolves around the job scheduling mechanism, to manage jobs associated with nodes in the cluster, and relocate those jobs when notified that a node is no longer available.

       

      In the Switchyard integration, with the modifications discussed above, the use of jobs and the scheduler is no longer relevant in this situation. Therefore the BPEL process execution is simply one part of the processing in the pipeline associated with the invocation of an SCA application/service.

       

      As long as the transaction boundary of an invocation covers the lifecycle of the invoke, and can therefore rollback all activity performed within its scope, the clustering/failover becomes an issue for switchyard. If a service request fails, e.g. due to a node failure, it should be possible to re-issue the request to switchyard (running on a different node) and have that request run successful, assuming that the BPEL component on both of the servers is configured to use the same database.

       

       

      Basic Scheduler

       

      Although the invocation approach mentioned above no longer utilitises the scheduler, there are still cases when a scheduler may be required, such as with wait states.

       

      However when a BPEL process is used in the context of an SCA app, its individual invocations should only exist within the lifetime of a single invocation - for example, a process instance should not be able to wait for 5 hours before returning a response, and expect the switchyard invocation context in which it was called to still be available. So the individual invocations still need to exist within a reasonably short time frame. This is actually no different to the current Riftsaw2 version, as individual request/response operations need to be performed within the timeout window associated with a web service invocation.

       

      We need to consider further test cases, of long running process instances, to see whether any issues may arise, but one case that comes to mind would be:

       

      "A BPEL process is invoked and immediately returns a response, but then invokes another external service (possibly after a wait interval). The initial req/resp would have been handled, and using the scheduler approach the client would receive the response while the engine continued to run the remaining aspects of the process instance. In SCA (and switchyard), all activity would be expected to occur within the scope of the req/resp."

        • 1. Re: Managing BPEL invocation jobs in Switchyard
          jeff.yuchang

          Firstly, I have to say that the Transaction usage in ODE is an ugly part. It is not really easy to understand, maintain.

           

          Back to your question.  In the way that you are updating current Transaction code, I was thinking would it be doable that we start a new Transaction, while suspend the outer transaction. and then resume it after the inner one finished.

           

          As you said, the Job and Scheduler is very important in the fail-over, clustering feature. Also it is neccessary for the activity like Wait activity. Below is the job Types that we used within ODE.

           

              public enum JobType {

                  TIMER,

                  RESUME,

                  INVOKE_INTERNAL,

                  INVOKE_RESPONSE,

                  MATCHER,

                  INVOKE_CHECK

              }

           

          So, if we remove the INVOKE_INTERNAL type, as we bypassed it to invoke the process instances directly, we'll also need to take care of other JobTypes. I don't have a detailed analysis on these jobs' usage, or scenarios yet, but if we plan to bypass it, we'll need to spend some time on checking these, say what to keep, what to bypass.

           

          Gary Brown wrote:

           

          Clustering and Failover

           

          In the Switchyard integration, with the modifications discussed above, the use of jobs and the scheduler is no longer relevant in this situation. Therefore the BPEL process execution is simply one part of the processing in the pipeline associated with the invocation of an SCA application/service.

           

          As long as the transaction boundary of an invocation covers the lifecycle of the invoke, and can therefore rollback all activity performed within its scope, the clustering/failover becomes an issue for switchyard. If a service request fails, e.g. due to a node failure, it should be possible to re-issue the request to switchyard (running on a different node) and have that request run successful, assuming that the BPEL component on both of the servers is configured to use the same database.

           

          I am afraid of this approach might not work well against the async process cases. Long-running processes. For example, if a process execution would take 3 hours to complete. In SCA container/switchyard, it won't wait for the result back, but commit the transaction firstly. Once the process completes, it will use a callback in the SCA container to pass the result back. In this particular process, say we have  a wait Activity in the process definition, if it happens this node was crash, and we don't have our current fail-over feature. (using job and scheduler), this process execution wouldn't be finished forever, is it right? I am not exactly sure if this particular scenario is what you described in the following paragraph,

          Gary Brown wrote:

           

          "A BPEL process is invoked and immediately returns a response, but then invokes another external service (possibly after a wait interval). The initial req/resp would have been handled, and using the scheduler approach the client would receive the response while the engine continued to run the remaining aspects of the process instance. In SCA (and switchyard), all activity would be expected to occur within the scope of the req/resp."

           

          In my heart, I still like the Job and Scheduler's idea, although it might not look like efficient, but it does increase the concurrency and reliability.  What we will need to do is to have a think how to deal with the Transaction usage here. I'll need more time to think about this.

           

          Gary Brown wrote:

           

          However when a BPEL process is used in the context of an SCA app, its individual invocations should only exist within the lifetime of a single invocation - for example, a process instance should not be able to wait for 5 hours before returning a response, and expect the switchyard invocation context in which it was called to still be available. So the individual invocations still need to exist within a reasonably short time frame. This is actually no different to the current Riftsaw2 version, as individual request/response operations need to be performed within the timeout window associated with a web service invocation.

          I checked some SCA materials today, I seem to look at these spec/materials on and off, so if I made some obvious mistakes, please correct me.

           

          In the BPEL component for SCA spec, it has a mechanism for long running process, it is called bidirection, simply it is just a callback. Although I've checked on the Tuscany and Fabric3 container, none of them have this feature. I even couldn't find a bpel component from Fabric3 from its documentation.

           

          Below is some limitation in the bpel component in Tuscany 1.x.

           

          The lack of support for these extensions means that the following features aren’t supported:

          ■ Properties

          ■ Multivalued references

          The Tuscany runtime also lacks support for the following features:

          ■ interface.partnerLinkType

          ■ Local partner links

          ■ Conversations

          ■ Callbacks

          ■ Faults across references

          ■ Implementation policies

           

          We should definitely go for similar way, (like supporting a subset of features/spec in our first integration release), but just thinking the jobs and schedulers might still be needed if we need to implement above (e.g advanced features).

          • 2. Re: Managing BPEL invocation jobs in Switchyard
            objectiser

            All of this will need to be verified with suitable examples, but I think the current 'direct invocation' of the INVOKE_INTERNAL job type, and the use of the Job/Scheduler for other job types, would work (and is what is implemented at the moment).

             

            For example, if a BPEL process is invoked within an existing transaction (as part of a req/resp), then this will bypass the scheduler and the response will need to be returned within the appropriate timeout period. However if the process the continues to perform a wait activity, this will schedule a job which can then be activated in a separate transaction at a later date. We just might want to discourgage this style of BPEL process (within a SCA/switchyard context).

             

            However the callback case would need to be handled separately at a later date, as it would need to integrate with the callback facility in SCA/switchyard.

            • 3. Re: Managing BPEL invocation jobs in Switchyard
              jeff.yuchang

              Agreed that we need to have more examples to verify, and I think it should work with current approach, but as you said, we will need to have this documented and discourage users to have long running process with our 'direct invocation' mechanism.

               

              The other thing is that we may worth asking switchyard to see if it is possible to have it run in a separate thread. (i.e if there is a configuration option that we can use?)