11 Replies Latest reply on Mar 13, 2008 11:57 AM by dave_lund

    clustered messages stuck on queues

    dave_lund

      There's a good chance that this could be related to:
      http://jira.jboss.org/jira/browse/JBMESSAGING-1245

      Messages are getting 'stuck' in the database and I have to restart a single app server before these messages are delivered

      However, this problem only seems to happen very sporadically (around every 10-14 days) on our production environment and hence im hesitent of changing the prefetchSize before asking the forum.

      We are using Jboss AS 4.2.2.GA and JBM 1.4.0.SP3 backed by a mysql 5 database and JRE 6 on a debian OS. The queues are clustered over 3 servers. We are using the MessageConsumer.receive(1000) method to receive messages that runs every 1-5 seconds depending on the queue. I know this is happening as some newer messages are being delivered correctly. The queues all have the following properties set: Clustered=true and MaxDeliveryAttempts=1

      Below are the queue message counts - 99% of these are messages that are stuck on the queues.
      server 1: queueA=357,queueB=0,queueC=51,queueD=8
      server 2: queueA=353,queueB=3,queueC=941,queueD=1
      server 3: queueA=9,queueC=5,queueC=37,queueD=2

      After restarting server 1 using the jboss shutdown.sh script all messages on servers2 and 3 were sent and then after starting server 1 all messages on its queues were then delivered. I would doubt our code doing the batching effect if it wasn't for this re-starting behaviour.

        • 1. Re: clustered messages stuck on queues
          timfox

          Hello Dave-

          Can you provide instructions on how to replicate the issue and we'll make sure someone investigates it further?

          Thanks.

          • 2. Re: clustered messages stuck on queues
            dave_lund

            Unfortunately this is where were having a problem, we have been unable to replicate this issue. We have tried putting a testing cluster behind the same subnet (obiviously only changing partitions,multicast ips and ports) and passed through 5 million messages successfully. This took between 1-2 days due to messaging processing time of our application. However, our application probably only processed in the region of 7-800k messages inbetween this issue happening. We used the same zip of jboss ( messaging installed, config etc) we use on our production system.

            I suppose the next step would be to turn prefetchSize to 1000 to try rule out jira issue 1245

            • 3. Re: clustered messages stuck on queues
              timfox

              One thing that strikes me: You say you're using consumer.receive(1000) to receive messages from the queue.

              Are you saying you are polling in a loop? Any reason you don't use a MessageListener?

              What happens if the call to receive(1000) returns with null? - this is perfectly possible even if there are messages in the queue - e.g. if there was some big GC which prevented messages getting from server to client buffer in less than one second.

              • 4. Re: clustered messages stuck on queues
                dave_lund

                Unfortunately our application works much much faster if it can process say 100 messages in a batch. See below a little snippet of example code:

                List<Message> messages = new ArrayList<Message>(maxLimit);
                Message msg;
                MessageConsumer receiver = session.createConsumer(queue);
                while(messages.size() < maxLimit && ( msg = receiver.receive(1000)) != null){
                 messages.add(msg);
                }
                receiver.close();
                //process batch
                session.commit();
                


                I will make sure that I double check GC next time we have this issue as we used to use a hotwired GC in jboss4.0.X before we upgraded to Jboss4.2.2. But I am sure(ish) we've looked and rule GC out.

                • 5. Re: clustered messages stuck on queues
                  dave_lund

                  Also if it were GC would all messages get stuck, the behaviour were seeing is say messages 1-3 are process perfectly, message 4 gets stuck, messages 5-9 are received later and are processed. My problem is over 900 messages were stuck, and only 450 of those could have been on local queues, so all servers must have ran into the GC problem at the same time

                  • 6. Re: clustered messages stuck on queues
                    timfox

                    That code doesn't look valid to me.

                    The call to receive(1000) could legitimately return null even thought there are messages in the queue. This would cause it to break out of the loop and process the batch.

                    Is this what you want?

                    You could just do the same code in a message listener surely?

                    • 7. Re: clustered messages stuck on queues
                      dave_lund

                      yes. The wrapping code actually re-schedules this code to run after a set amount of time.

                      So the only way this code would not be able to receive messages is if receiving the next message was always going to take longer than 1 second (which has been the case for several hours). If this is possible we will obviously have to re-address the code

                      I was under the impression that JBM followed the JEE spec rather than JMS spec in stating that you can not call receive() if a MessageListener was set. Is there a pattern we should be following to process batches using a MessageListener in JBM? I have to admit the original code was written for a JMS competitor but we were impressed by JBM and rather lazily 'fudged' existing code to work on JBM

                      • 8. Re: clustered messages stuck on queues
                        timfox

                        Dave, take a look at http://jira.jboss.org/jira/browse/JBMESSAGING-1245, maybe it explains your issue?

                        Regarding batching, have a look at how the JBoss Messaging bridge does it. It uses onMessage(), and you can configure batches to have a maximum number of messages and/or a maximum time.

                        • 9. Re: clustered messages stuck on queues
                          dave_lund

                          Cheers Tim,

                          On production we have upped the prefetch size to test against JBM-1245 - However it will be at least a couple of weeks before we can decide if this fixes the problem (as the issue happens so infrequently)


                          We have mocked up something along the lines of the messaging bridge and it gave us some surprising results that may interest you:

                          Per 1000 units of work:
                          receive(1000): averaged 1:31 mins
                          Bridge style: averaged 1:17 mins

                          • 10. Re: clustered messages stuck on queues
                            dave_lund

                            Our attempt at copying the batching in the Bridge class has hit a bit of a problem. Our version didn't work as even though all messages were passed through and processed (and commited), some processed messages were still visible in the database after, and were re-processed after a jboss restart. These lingering messages were always the first message in the messages collection per transaction. In the jmx-console the messages are marked as currently being delivered, but the bridge believes (and rightly so) that it has delivered all messages

                            Hence we have created a test based on the Bridge class, all we have changed is the jboss logger to become an apache logger, and instead of the ConnectionFactoryFactory and DestinationFactory jndi wrappers we're just passing the ConnectionFactory/Destinations directly. Messages are read off a test queue and then put on the dead letter queue after.

                            Our test uses a queue with the following config on the queue:
                            clustered=true, maxdeliveryattempts=1

                            The connectionFactory (we were using ClusteredConnectionFactory) has:
                            supportsfailover=true, supportsLoadBalancing=true,prefetchsize=1000.

                            The Bridge class is created using:
                            new Bridge(getConnectionFactory(), getConnectionFactory(), getQueue("queue/testingQueue"), getQueue("queue/DLQ"), null, null, null, null, null, 1000, 10, Bridge.QOS_ONCE_AND_ONLY_ONCE, 100, 1000, null, null, false);

                            The connection factories are the same, so that local tx's are used.

                            This is happening consistently, 5 messages in 1000 so I think it probably more likely mis configuration than anything else. The environment remains the same as earlier in the post (Jboss AS 4.2.2.GA and JBM 1.4.0.SP3 backed by a mysql 5 database and JRE 6 on a debian OS)

                            • 11. Re: clustered messages stuck on queues
                              dave_lund

                              Heres some logging trace for a message that gets stuck:

                              2008-03-13 15:37:19,427 TRACE [com.test.Bridge] com.test.Bridge$BatchTimeChecker@14bfbf1 waiting for 1000
                              2008-03-13 15:37:19,427 TRACE [com.test.Bridge] com.test.Bridge$SourceListener@26c025 received message delegator->JBossMessage[1298]:PERSISTENT, deliveryId=8
                              2008-03-13 15:37:19,427 TRACE [com.test.Bridge] com.test.Bridge$SourceListener@26c025 rescheduled batchExpiryTime to 1205422640427
                              2008-03-13 15:37:20,427 TRACE [com.test.Bridge] com.test.Bridge$BatchTimeChecker@14bfbf1 woke up
                              2008-03-13 15:37:20,427 TRACE [com.test.Bridge] com.test.Bridge$BatchTimeChecker@14bfbf1 waited enough
                              2008-03-13 15:37:20,427 TRACE [com.test.Bridge] com.test.Bridge$BatchTimeChecker@14bfbf1 got some messages so sending batch
                              2008-03-13 15:37:20,427 TRACE [com.test.Bridge] Sending batch of 1 messages
                              2008-03-13 15:37:20,427 TRACE [com.test.Bridge] Sending message delegator->JBossMessage[1298]:PERSISTENT, deliveryId=8
                              2008-03-13 15:37:20,428 TRACE [com.test.Bridge] Sent message delegator->JBossMessage[1304]:PERSISTENT, deliveryId=8
                              2008-03-13 15:37:20,428 TRACE [com.test.Bridge] Committing source session
                              2008-03-13 15:37:20,438 TRACE [com.test.Bridge] Committed source session
                              2008-03-13 15:37:20,438 TRACE [com.test.Bridge] com.test.Bridge$BatchTimeChecker@14bfbf1 sent batch