1 2 Previous Next 19 Replies Latest reply on Mar 11, 2008 10:29 AM by timfox

    messages stuck in queues

    martin.wickus

      we upgraded from the following:

      JBoss Messaging 1.4.0.GA
      JBoss Remoting 2.2.2.SP1

      to:

      JBoss Messaging 1.4.0.SP3
      JBoss Remoting 2.2.2.SP4

      Suddenly, under load we are getting messages stuck in queues.

        • 1. Re: messages stuck in queues
          martin.wickus

          exact builds using:

          JBM: Specification-Version: 1.4.0.SP3
          Implementation-Version: 1.4.0.SP3 (build:CVSTag=JBossMessaging_1_4_0_ SP3_CP01 date=200712141423)

          JBR:
          Specification-Version: 2.2.2.SP4
          Implementation-Version: 4.3.0.GA (build: VNTag=JBPAPP_4_3_0_GA date=200801031548)

          I can see my consumer count is 1, thus a consumer is definately active. The queue also has an expiry queue configures, so it is queries that my messages aren't getting expired after the 60 seconds I configure them to be.

          The messages are all non-persistent.



          • 2. Re: messages stuck in queues
            timfox

            Hello Wickus-

            Can you ensure that you upgraded to remoting 2.2.2.SP4 on both client and server side?

            We had a few issues reported that sounded like this when remoting wasn't upgraded to 2.2.2SP4 everywhere in the system. Remoting 2.2.2.SP4 is not compatible with earlier versions.

            • 3. Re: messages stuck in queues
              martin.wickus

              Hi. The clients were using the following libraries:

              JBM: Implementation-Version: 4.3.0.GA (build: SVNTag=JBPAPP_4_3_0_GA date=200801031548). This taken from EAP 4.3\client
              JBR: Implementation-Version: 4.3.0.GA (build: SVNTag=JBPAPP_4_3_0_GA date=200801031548). This taken from EAP 4.3\client

              • 4. Re: messages stuck in queues
                martin.wickus

                We rolled back to the old libraries and config files and all is working fine. However, for those following this thread, would like to point out I'm not convinced this is a bug in JBM. We might have made a couple of mistakes during out deployment. Email thread posted for public interest:

                Just a bit of background.

                We've been running with EAP 4.2 and JBM 1.4.0.GA for a while. I'm quite aware this is not the environment supported by Red Hat, but this is what we have and we've been on a steady path to becoming 100% compliant: We were running JBoss 4.2.1 GA and ActiveMQ, replaced ActiveMQ with JBM and then took on Red Hat support. At this point we upgraded to JBoss EAP 4.2 (but kept JBM in place). I've been working on a branch running JBoss EAP 4.3 and the default configured JBM 1.4.0.SP3, however this is not yet deployed into production. We had a couple of issue with JBM 1.4.0.GA (or really JBoss Remoting I should say) which was sorted out by using a slightly modified version of the remoting-biscoket-service.xml that bundles with JBM 1.4.0.SP3. Please note we were still at this time using the JBM 1.4.0.GA libraries. However, as JBM 1.4.0.SP3 is the stable release, we heeded a suggestion by Red Hat and decided to upgrade. That's when the problem occurred.

                As already mentioned, I'm going to run a couple of experiments today to see whether I can narrow down the reason for the problem. I'm not yet convinced it is a bug in JBM. My reasons are:

                1. We used the same JBM database schema from 1.4.0.GA. This did not give any problems when we tested in the dev environments. However, after more careful inspection of the SQL in oracle-persistence.sql I noticed a couple of changes were in place (e.g. new index, delete_message sql script switched order of parameters arround ... this could be problem is JBM is not using name parameters, but positional, supports_blob_on_select flag, composite primary key for JBM_MSG_REF changed declaration order for composite columns, etc). I don't know if this could be the cause of our problem and the old schema certainly worked fine for us in the development environment, but I'll try and use the new schema and see whether that makes a difference.

                2. The following customisations were dropped from connection-factories-service.xml
                Setting attribute PrefetchSize to 1000.
                Setting attribute SlowConsumers to false.

                3. Were PostOffice was marked as not clustered before, it was deployed as clustered this time round (default from bundled oracle-persistence-service.xml). Do not anticipate this to be a problem since we are running just a single node.

                4. Using EAP 4.3 build for JBR 2.2.2.SP4 --- Implementation-Version: 4.3.0.GA (build: VNTag=JBPAPP_4_3_0_GA date=200801031548) instead of the BREW library. I reckoned the EAP 4.3 one would be most stable since tested by Red Hat. However, perhaps there are modifications specific to EAP 4.3.

                PS. Clients were using the following libraries:
                JBM: Implementation-Version: 4.3.0.GA (build: SVNTag=JBPAPP_4_3_0_GA date=200801031548). This taken from EAP 4.3\client
                JBR: Implementation-Version: 4.3.0.GA (build: SVNTag=JBPAPP_4_3_0_GA date=200801031548). This taken from EAP 4.3\client


                • 5. Re: messages stuck in queues
                  martin.wickus

                  Tim Fox's reply inline:

                  >

                  >
                  > Just a bit of background.
                  >
                  > We've been running with EAP 4.2 and JBM 1.4.0.GA for a while. I'm quite
                  > aware this is not the environment supported by Red Hat, but this is what
                  > we have and we've been on a steady path to becoming 100% compliant: We
                  > were running JBoss 4.2.1 GA and ActiveMQ, replaced ActiveMQ with JBM and
                  > then took on Red Hat support. At this point we upgraded to JBoss EAP 4.2
                  > (but kept JBM in place). I've been working on a branch running JBoss EAP
                  > 4.3 and the default configured JBM 1.4.0.SP3, however this is not yet
                  > deployed into production. We had a couple of issue with JBM 1.4.0.GA (or
                  > really JBoss Remoting I should say) which was sorted out by using a
                  > slightly modified version of the remoting-biscoket-service.xml that
                  > bundles with JBM 1.4.0.SP3. Please note we were still at this time using
                  > the JBM 1.4.0.GA libraries. However, as JBM 1.4.0.SP3 is the stable
                  > release, we heeded a suggestion by Red Hat and decided to upgrade.
                  > That's when the problem occurred.
                  >
                  > As already mentioned, I'm going to run a couple of experiments today to
                  > see whether I can narrow down the reason for the problem. I'm not yet
                  > convinced it is a bug in JBM. My reasons are:

                  I don't want to speculate too much at this point, but 1.4.0.SP3 is our
                  most highly tested JBM release - having gone through rigorous load and
                  soak test with our QA department before it was allowed to go in the EAP.
                  So a bug of this magnitude slipping through the net would surprise me,
                  although, of course we can't rule this out.


                  >
                  > 1. We used the same JBM database schema from 1.4.0.GA.
                  > This did not give
                  > any problems when we tested in the dev environments. However, after more
                  > careful inspection of the SQL in oracle-persistence.sql I noticed a
                  > couple of changes were in place (e.g. new index, delete_message sql
                  > script switched order of parameters arround ... this could be problem is
                  > JBM is not using name parameters, but positional,
                  > supports_blob_on_select flag, composite primary key for JBM_MSG_REF
                  > changed declaration order for composite columns, etc). I don't know if
                  > this could be the cause of our problem and the old schema certainly
                  > worked fine for us in the development environment, but I'll try and use
                  > the new schema and see whether that makes a difference.

                  Yes the schema has changed between GA and SP3. It is critical that the
                  old database is dropped before installing the new version, otherwise all
                  kinds of strange problems might occur.


                  >
                  > 2. The following customisations were dropped from
                  > connection-factories-service.xml
                  > Setting attribute PrefetchSize to 1000.
                  > Setting attribute SlowConsumers to false.

                  This may cause behavioural differences w.r.t message consumption.

                  >
                  > 3. Were PostOffice was marked as not clustered before, it was deployed
                  > as clustered this time round (default from bundled
                  > oracle-persistence-service.xml). Do not anticipate this to be a problem
                  > since we are running just a single node.

                  Best to set clustered = false though if you are running a single node.

                  >
                  > 4. Using EAP 4.3 build for JBR 2.2.2.SP4 --- Implementation-Version:
                  > 4.3.0.GA (build: VNTag=JBPAPP_4_3_0_GA date=200801031548) instead of the
                  > BREW library. I reckoned the EAP 4.3 one would be most stable since
                  > tested by Red Hat. However, perhaps there are modifications specific to
                  > EAP 4.3.

                  Yes, that is a possibility. The EAP versions of a product and the
                  community version of the product can and do diverge sometimes, this is
                  mainly because we can provide bug fixes etc on the EAP version that's
                  not available on the free version until a later date. Not sure if this
                  applies to those versions of JBR but it's possibility. To be safe, it's
                  always wise not to mix and match version from the EAP and community
                  versions.

                  If you want to run JBM 1.4.0.SP3 inside EAP 4.2, you should obtain the
                  JBM jar from the download on the labs site:

                  http://labs.jboss.org/jbossmessaging/downloads/

                  And the JBoss Remoting version should be obtained from here:

                  http://repository.jboss.com/jboss/remoting/2.2.2.SP4-brew/lib/

                  To summarise, in order to upgrade versions, you should follow the
                  following steps:

                  1) Drop the old database
                  2) Obtains the distro and jars from above urls.
                  3) Replace jboss-messaging.jar in the app server in the
                  server/messaging/lib directory with the one inside the distro. (assuming
                  you have named your server profile "messaging")
                  4) Replace jboss-remoting.jar in the app server in the
                  server/messaging/lib directory with the one download from the above url.
                  5) Replace all *.xml files in
                  server/messaging/deploy/jboss-messaging.sar/ with their equivalents from
                  the JBM distro you downloaded.
                  6) Re-apply any custom changes (e.g. prefetchSize, slowConsumers etc)
                  that you made in your previous installation to those files. (There have
                  been config changes between GA and SP3)
                  7) If you are using ServiceBindingManager service in JBoss AS, update
                  the JBM remoting configuration section in the SBM config to exactly
                  reflect the new JBR config.
                  8) For every client that connects to JBM, need to update make sure the
                  new jboss-messaging-client.jar and new jboss-remoting.jar that you
                  downloaded are on the client classpath *before* any other jars.

                  As you can see it's a bit fiddly, but a simple replace the jars almost
                  certainly won't work.

                  Alternatively, If you are willing to recreate your server profile, you
                  could just the automated install instructions from the user guide. But
                  I'm not sure if you're able to do that.


                  • 6. Re: messages stuck in queues
                    martin.wickus

                    I think I've found a bug in JBM 1.4.0.SP3.

                    I ran JBM 1.4.0.SP3 to reproduce my earlier problem in UAT.

                    I then went through each of the items listed as oversights before until the problem dissapeared. The order was:

                    1. Updated to latest schema. Problem still occurs.
                    2. Changed to non-EAP version of JBM and JBR libraries. Problem still occurs.
                    3. Turned clustering for PostOffice off. Problem still occurs.
                    4. Changed PrefetchSize to 1000 and SlowConsumers to false. Problem fixed.

                    I undid/repeated step 4 a few times and it this is definately my problem area.

                    I then switched to the JBM 1.4.0.GA libraries to see if I can reproduce the problem. I couldn't. It works stably whether SlowConsumers is true or false.

                    This makes me think that between JBM1.4.0.GA and JBM1.4.0.SP3, there must have been a change to the client consumer flow control.

                    I compared the source code for the revisions and noticed there was significant refactoring in org.jboss.jms.client.container.ClientConsumer.

                    Additionaly, since the behavior happens only after a period and only under heavy load, this sounded like a threading problem.....

                    I can't be sure by looking at the code as I'm not a JBM expert, but I do notice that most of the time when consumeCount gets modified, it is done so within the mainLock. However, not always .... so this might cause contention issues. Perhaps consumeCount should be declared as volatile to prevent threads storing local values for the variables.




                    • 7. Re: messages stuck in queues
                      timfox

                      Wickus -

                      Can you clarify what value of prefetchSize and slowConsumers you were using when you saw the problem?

                      Just to be clear- you're saying that the with the default values its ok, but when you change them to your values you get the problem?

                      • 8. Re: messages stuck in queues
                        martin.wickus

                        No. By default these values were not specified and that's when I get the problem. When I set SlowConsumers to false and PrefetchSize to 1000 the problem did not occur.

                        • 9. Re: messages stuck in queues
                          timfox

                          Sorry, I'm a bit confused.

                          So, you're saying you saw the problem with the default, out of the box values, with no changes?

                          What values do you need for your system - do you need to change the prefetchSize or slowconsumers?

                          • 10. Re: messages stuck in queues
                            martin.wickus

                            That's correct. Under item 2 for post

                            Tue Mar 4, 2008 07:55 AM
                            you'll see I mentioned that when we upgraded from JBM 1.4.0.GA to JBM1.4.0.SP3, we did not bring across our customizations (those two attributes specifically).

                            In other words, under JBM 1.4.0.GA, we'd customized our connection-factories-service.xml to include those attributes. Then when we upgraded to JBM 1.4.0.SP3, we did not apply those customizations again. This is when we experienced the problem. However, after re-adding those customizations, everything works.



                            • 11. Re: messages stuck in queues
                              timfox

                              I want to try and replicate this on this end.

                              Can you give me more information on your setup? Size of messages, number of messages, number of consumers etc etc. so we can replicate something similar? And what I should do to replicate?

                              Looking at the code the only time consumecount gets modified outside the lock is on failover - when you replicate the issue do you need failover to occur?

                              (The default values for prefetchSize and slowConsumers are 150 and false respectively so you have effectively just increased the prefetchsize to resolve the issue.)

                              • 12. Re: messages stuck in queues
                                martin.wickus

                                No, we're not running a cluster, thus no failover.

                                I'll need to come back to you wrt the requested information.

                                • 13. Re: messages stuck in queues
                                  timfox
                                  • 14. Re: messages stuck in queues
                                    martin.wickus

                                    I am busy setting up a dev environment hooked up to production data (same messaging load for feeds) to replicate this. I will also try and debug the code ,assuming I can trigger the condition, and will update the results today/tomorrow.

                                    Rationale: Our UAT/PROD environments are not ideal for this due to firewall setups and difficulty gaining access in order to retrieve logs, etc.

                                    1 2 Previous Next