4 Replies Latest reply on Mar 9, 2010 4:16 AM by stevere

    2.0.0GA Causing Application Server to use 100% CPU

    stevere

      From the title this sounds drastic, but this is really what I am experiencing. I have to shut the application server down to resolve the issue.

       

      I am running JBoss5/HornetQ 2.0.0 GA on Windows XP Pro. We have a number of remote clients (5) that send JMS messages to one queue and await a response on another queue. Each client starts a filtered consumer on the receive queue so as to receive the intended response. What I am finding is that after a period of time ~1-20 mins, I will begin to see all clients fail, with a message :-

       

      04-Mar-2010 09:32:48 org.hornetq.core.logging.impl.JULLogDelegate warn

      WARNING: Connection failure has been detected: Did not receive data from server for org.hornetq.integration.transports.netty.NettyConnection@19f3736[local= /10.1.10.19:2262, remote=remoteserver/10.1.11.34:5445] [code=3]

       

      At the same time CPU usage on the server will jump from  ~5% to 100% and the JBoss is consuming most of it. Initially I put this down to perhaps machine load or a network issue, but this isn’t the case, what appears to be happening is that the HornetQ async queue delivery mechanism is spinning :-

       

      Thread [Thread-30 (group:HornetQ-server-threads25045315-29826735)] (Suspended (breakpoint at line 1591 in QueueImpl$FilterMessageHandler)) QueueImpl$FilterMessageHandler.peek(Consumer) line: 1591

      QueueImpl.deliver() line: 1137

      QueueImpl.access$400(QueueImpl) line: 66

      QueueImpl$DeliverRunner.run() line: 1442

      OrderedExecutorFactory$OrderedExecutor$1.run() line: 96 T

      hreadPoolExecutor$Worker.runTask(Runnable) line: 886

      ThreadPoolExecutor$Worker.run() line: 908

      Thread.run() line: 619

       

      It appears that the QueueImpl.deliver() method can find itself in a state where it never breaks the delivery loop.  Let me explain further: the following code fragment is used to break the delivery loop in QueueImpl:-

       

      if (pos == startPos) {

           // We've done all the consumers

                if (nullCount + busyCount == totalCount)

                {

                     if (nullCount == totalCount)

                     {

                          // We delivered all the messages - go into direct delivery

                          direct = true;

                          promptDelivery = false;

                     }

                     break;

                }

                 nullCount = busyCount = 0;

      }

       

      The first test is used to determine when all handlers have had a chance at processing a message. The second test is the one that will actually break the loop. This I think is suppose to test that there is no more work to do, ( either all messages in the handlers lists have been tried/handled, or the handlers report they are busy ). If one handler reports it could or couldn’t match a message then the counters are reset and the loop continues.

       

      So I believe that if handlers/consumers exist with filters that will never match a message, and one or more handler message iterators are out of step with each other (handler A is on message 5,  but handler B is on message 4)  then this problem will occur.

       

      How to fix it, I’m not sure. I think the key is that HandleStatus.HANDLED and HandleStatus.NO_MATCH are treated the same way as far as terminating the processing loop is concerned, however HandleStatus.HANDLED would mean that a message is removed from the Queue and eventually there would be no more messages so each handler would return a null message reference – causing the loop to break. But HandleStatus.NO_MATCH will cause the process to loop for-ever.

       

      Because QueueImpl.deliver() method is synchronized then any attempt to do anything else with the Queue will block. That includes cleaning up consumers /handlers etc.

       

      I have a simple test program. If I run two instances of, to simulate two clients with 6 threads, I can reproduce this issue ~50% of the time. I'll attach this to the post.

       

      I have searched the other posts and JIRA but couldn't find anything similar. Obviously I can and should raise a JIRA Issue, which I will do, but I need to identify a fix/workaround for this issue in the short term. Does anyone have any suggestions?