9 Replies Latest reply on Dec 12, 2008 9:58 AM by objectiser

    Problems in CDL-M1 sample app

    bernd.koecke

      Hello,

      first, its great to see an M1 release :).

      I played a little bit around with the purchasing sample app and encountered a blocked choreography. I think there are two things wich caused this. My environment is:

      - JBossAS 4.2.3 standalone and on a two node cluster
      - JBossMessaging 1.4.0.SP3
      - JBossESB 4.4
      - CDL-M1
      - MySQL 5.0

      It seems that there is a racing condition in the join of the ParallelAction. Normally I see in the server.log two lines which together decrement the pathCount to "0". But there are sometimes two lines which both count only to "1". The events/messages are consumed and the result is a dead active session in database, the process doesn't goes on and the client gets a delivery exception after the timeout is reached.

      After this, no successful calls to the process are possible. The client sets the id to 'id="5"'. And every call generates a new session in database. The creditAgency is called, but when the result is returned the purchaseApp searches for a session with 'id="5"', which is active. The first is the old dead one and this is selected. Then the process checks if the session has the right combination of category/name in it. But the selected dead session has the wrong combination and the process stops with an exception. You have to delete the whole data of the dead session from database and then the client calls are successful again.

      The search for a perfect match can cause a kind of table scan, so this might not be a good idea. I remember that the id-string should be a unique identifier. In the sample app it is always the same string. So in production the problem will not arise very often. But it can happen, consider a process which handles customer data, the id is the customerNo. This process fails and later the process is called again for this customer with the same customerNo as id.

      It may be that the racing condition is caused by my MySQL-Datasource-Configuration, but I got dead sessions by other failures in my processes, too.

      In summary, the blocked process is caused by a dead active session with an non unique id in combination with the select algorithm of sessions.

      Did anybody consider similar problems?

      Regards,
      Bernd

        • 1. Re: Problems in CDL-M1 sample app
          objectiser

          Hi Bernd

          Sounds like the initial problem is caused by a db txn isolation problem. Concurrent updates to the parallel thread count are occurring, with the result that both threads save the same remaining value back to the stored session.

          I know that Jeff was doing some work on the MySQL hibernate configuration - so possibly best if he investigates this part.

          However, your subsequent problems (as you have identified) are because the same identity is used. This is why it is important when building a choreography to understand what are the unique identities for each independent business transaction - so as you say in production this should not be a problem.

          In the example you gave - a customer number typically would not be unique enough as an identifier - as it is possible that a customer may want to perform multiple transactions concurrently. The best approach is not to think about what is a unique id at any particular point in time, but what id is unique over time.

          So in the case of a customer placing an order, really you need a unique order number - which can be used in conjunction with the customer number.

          Identifying the best way to tidy up sessions that have failed (e.g. due to out of sequence messages) is an issue that still needs to be dealt with. If a failure occurs as part of the business process, it can be modelled in the choreography and results in the session moving to a conclusion.

          However if the infrastructure detects that a system is not correctly executing according to the defined choreography, it can actively block the message being delivered (and possibly cause the problem to result in an error on the client side if sync invocation), but that may leave the session at the service in a limbo state.

          If you have any thoughts on how this could be dealt with then I would be interested in your views.

          Regards
          Gary

          • 2. Re: Problems in CDL-M1 sample app
            jeff.yuchang

            Hi Bernd,

            Firstly, thanks for trying it out. ;-)


            "bernd.koecke" wrote:


            It seems that there is a racing condition in the join of the ParallelAction. Normally I see in the server.log two lines which together decrement the pathCount to "0". But there are sometimes two lines which both count only to "1". The events/messages are consumed and the result is a dead active session in database, the process doesn't goes on and the client gets a delivery exception after the timeout is reached.



            Yeah, this is an issue which I was trying to use the database's lock to get this issue resolved, but hasn't been finished in the M1 release. Mostly, I was testing it against the HSQL in-memory database in M1 release,since it is very simple, but it seems to me that it didn't support the database's pessimistic lock very well. sorry for not specifying it explicitly in the release note. I just open a jira (https://jira.jboss.org/jira/browse/SOAG-73) for this.

            "bernd.koecke" wrote:


            After this, no successful calls to the process are possible. The client sets the id to 'id="5"'. And every call generates a new session in database. The creditAgency is called, but when the result is returned the purchaseApp searches for a session with 'id="5"', which is active. The first is the old dead one and this is selected. Then the process checks if the session has the right combination of category/name in it. But the selected dead session has the wrong combination and the process stops with an exception. You have to delete the whole data of the dead session from database and then the client calls are successful again.


            Yes, exactly right. sorry for this pain.

            "bernd.koecke" wrote:


            The search for a perfect match can cause a kind of table scan, so this might not be a good idea. I remember that the id-string should be a unique identifier. In the sample app it is always the same string. So in production the problem will not arise very often. But it can happen, consider a process which handles customer data, the id is the customerNo. This process fails and later the process is called again for this customer with the same customerNo as id.

            It may be that the racing condition is caused by my MySQL-Datasource-Configuration, but I got dead sessions by other failures in my processes, too.

            In summary, the blocked process is caused by a dead active session with an non unique id in combination with the select algorithm of sessions.

            Regards,
            Bernd


            As I said above, I was thinking about using the database's pessimistic lock, which means only one thread is able to update the database's row. Do you have any other solutions in mind, I would be keen to know. ;-)

            Thanks
            Jeff




            • 3. Re: Problems in CDL-M1 sample app
              bernd.koecke

              Hi Gary,

              "objectiser" wrote:
              Hi Bernd

              Sounds like the initial problem is caused by a db txn isolation problem. Concurrent updates to the parallel thread count are occurring, with the result that both threads save the same remaining value back to the stored session.

              I know that Jeff was doing some work on the MySQL hibernate configuration - so possibly best if he investigates this part.


              I'll have a look at my database config, too. I remember that I had a similar problem with MySQL and Hibernate in the past. The reason were concurrent reads/exclusive writes. We solved it by telling Hibernate to use "select for update" to read the counter. This requested an exclusive write lock for the select.

              "objectiser" wrote:

              However, your subsequent problems (as you have identified) are because the same identity is used. This is why it is important when building a choreography to understand what are the unique identities for each independent business transaction - so as you say in production this should not be a problem.

              In the example you gave - a customer number typically would not be unique enough as an identifier - as it is possible that a customer may want to perform multiple transactions concurrently. The best approach is not to think about what is a unique id at any particular point in time, but what id is unique over time.

              So in the case of a customer placing an order, really you need a unique order number - which can be used in conjunction with the customer number.


              I think it should be an id which identifies the process instance. Because it could be possible that a customers order is used in several processes which store their data in the same database. JBossESB has a correlation id, but it would not be a good idea to rely on an id from the underlying implementation. And it must be set when the entry service is called. This means the client must/should define the id, not the ESB. Do I understand it right when I make the following definition: The developer of the choreography must define an id which is unique enough for the whole choreography system?

              "objectiser" wrote:

              Identifying the best way to tidy up sessions that have failed (e.g. due to out of sequence messages) is an issue that still needs to be dealt with. If a failure occurs as part of the business process, it can be modelled in the choreography and results in the session moving to a conclusion.

              However if the infrastructure detects that a system is not correctly executing according to the defined choreography, it can actively block the message being delivered (and possibly cause the problem to result in an error on the client side if sync invocation), but that may leave the session at the service in a limbo state.

              If you have any thoughts on how this could be dealt with then I would be interested in your views.


              I think this is very difficult. Lets say we use an id wich is unique enough. This means our process is working well, the dead entries will mess up the database and the customer gains no success. A first step would be to add a timestamp of the last session update to the database schema. But without external knowledge you can't say at which age a session is dead. This depends on the process itself and on the load of the underlying machine etc. Here Service/Business Activity Monitoring comes into play. There you can define thresholds an notifications. I think deleting entries without an admin in between is dangerous.

              Regards,
              Bernd

              • 4. Re: Problems in CDL-M1 sample app
                bernd.koecke

                Hi Jeff,

                "jeff.yuchang" wrote:
                Hi Bernd,

                Firstly, thanks for trying it out. ;-)


                it was really fun :).

                "jeff.yuchang" wrote:

                As I said above, I was thinking about using the database's pessimistic lock, which means only one thread is able to update the database's row. Do you have any other solutions in mind, I would be keen to know. ;-)


                I did the same in the earlier project. I think I used a Criteria-Object and set the LockMode to update. Then an exclusive writelock was requested for the select. But I don't know what I had to do to get rowlevel locking instead of page- or tablelevel. But this may depend on the underlying database vendor and version :(.

                Regards,
                Bernd




                • 5. Re: Problems in CDL-M1 sample app
                  jeff.yuchang

                   

                  "bernd.koecke" wrote:
                  Hi Gary,

                  "objectiser" wrote:
                  Hi Bernd

                  Sounds like the initial problem is caused by a db txn isolation problem. Concurrent updates to the parallel thread count are occurring, with the result that both threads save the same remaining value back to the stored session.

                  I know that Jeff was doing some work on the MySQL hibernate configuration - so possibly best if he investigates this part.


                  I'll have a look at my database config, too. I remember that I had a similar problem with MySQL and Hibernate in the past. The reason were concurrent reads/exclusive writes. We solved it by telling Hibernate to use "select for update" to read the counter. This requested an exclusive write lock for the select.

                  Regards,
                  Bernd


                  Yes, thats the solution I was trying last time, but I didn't have luck to make it success. it might be due to the HSQL database. I remembered that I also tried with MySQL, but still no luck at that moment, thats why I just leave it at that moment.

                  Regard to the id unique concern, we presumed that each process will has its own unique id that we can rely on.

                  • 6. Re: Problems in CDL-M1 sample app
                    jeff.yuchang

                     

                    "bernd.koecke" wrote:
                    Hi Jeff,

                    "jeff.yuchang" wrote:
                    Hi Bernd,

                    Firstly, thanks for trying it out. ;-)


                    it was really fun :).

                    "jeff.yuchang" wrote:

                    As I said above, I was thinking about using the database's pessimistic lock, which means only one thread is able to update the database's row. Do you have any other solutions in mind, I would be keen to know. ;-)


                    I did the same in the earlier project. I think I used a Criteria-Object and set the LockMode to update. Then an exclusive writelock was requested for the select. But I don't know what I had to do to get rowlevel locking instead of page- or tablelevel. But this may depend on the underlying database vendor and version :(.

                    Regards,
                    Bernd




                    • 7. Re: Problems in CDL-M1 sample app
                      jeff.yuchang

                      sorry, not sure why my response wasn't shown in the latest comment? ;-)

                      Anyway, I said that when I referred the 'pessimistic lock', I would normally believe it is the table level lock. cause I think it use the 'select * for update' sql under cover. I will try it later then.

                      Thanks
                      Jeff

                      • 8. Re: Problems in CDL-M1 sample app
                        bernd.koecke

                        Hello Jeff,

                        I think we were to fast for the system and send the replies at the same time ;).

                        Regards
                        Bernd

                        • 9. Re: Problems in CDL-M1 sample app
                          objectiser

                          Hi Bernd

                          Do I understand it right when I make the following definition: The developer of the choreography must define an id which is unique enough for the whole choreography system?


                          The simple answer is yes, however it does not necessarily mean the same id has to be used throughout the complete choreography. CDL enables identity chains to be established - similar to primary and alternate ids in a database - as long as the association between multiple ids is established during the conversation, then they will all be related to the same choreography session instance.

                          I agree with your point about admin needing to be involved - have been trying to consider automated approaches, but they depend on the implementation technology and usually still require some user input (either to tidy up the session or alternatively fix and resubmit the erronous message). Its possible that some of this information could be added as annotations to the choreography, but needs more consideration.

                          Regards
                          Gary