4 Replies Latest reply on Mar 9, 2010 4:16 AM by stevere

2.0.0GA Causing Application Server to use 100% CPU

stevere Mar 4, 2010 7:13 AM

From the title this sounds drastic, but this is really what I am experiencing. I have to shut the application server down to resolve the issue.

I am running JBoss5/HornetQ 2.0.0 GA on Windows XP Pro. We have a number of remote clients (5) that send JMS messages to one queue and await a response on another queue. Each client starts a filtered consumer on the receive queue so as to receive the intended response. What I am finding is that after a period of time ~1-20 mins, I will begin to see all clients fail, with a message :-

04-Mar-2010 09:32:48 org.hornetq.core.logging.impl.JULLogDelegate warn

WARNING: Connection failure has been detected: Did not receive data from server for org.hornetq.integration.transports.netty.NettyConnection@19f3736[local= /10.1.10.19:2262, remote=remoteserver/10.1.11.34:5445] [code=3]

At the same time CPU usage on the server will jump from ~5% to 100% and the JBoss is consuming most of it. Initially I put this down to perhaps machine load or a network issue, but this isn’t the case, what appears to be happening is that the HornetQ async queue delivery mechanism is spinning :-

Thread [Thread-30 (group:HornetQ-server-threads25045315-29826735)] (Suspended (breakpoint at line 1591 in QueueImpl$FilterMessageHandler)) QueueImpl$FilterMessageHandler.peek(Consumer) line: 1591

QueueImpl.deliver() line: 1137

QueueImpl.access$400(QueueImpl) line: 66

QueueImpl$DeliverRunner.run() line: 1442

OrderedExecutorFactory$OrderedExecutor$1.run() line: 96 T

hreadPoolExecutor$Worker.runTask(Runnable) line: 886

ThreadPoolExecutor$Worker.run() line: 908

Thread.run() line: 619

It appears that the QueueImpl.deliver() method can find itself in a state where it never breaks the delivery loop. Let me explain further: the following code fragment is used to break the delivery loop in QueueImpl:-

if (pos == startPos) {

// We've done all the consumers

if (nullCount + busyCount == totalCount)

{

if (nullCount == totalCount)

{

// We delivered all the messages - go into direct delivery

direct = true;

promptDelivery = false;

}

break;

}

nullCount = busyCount = 0;

}

The first test is used to determine when all handlers have had a chance at processing a message. The second test is the one that will actually break the loop. This I think is suppose to test that there is no more work to do, ( either all messages in the handlers lists have been tried/handled, or the handlers report they are busy ). If one handler reports it could or couldn’t match a message then the counters are reset and the loop continues.

So I believe that if handlers/consumers exist with filters that will never match a message, and one or more handler message iterators are out of step with each other (handler A is on message 5, but handler B is on message 4) then this problem will occur.

How to fix it, I’m not sure. I think the key is that HandleStatus.HANDLED and HandleStatus.NO_MATCH are treated the same way as far as terminating the processing loop is concerned, however HandleStatus.HANDLED would mean that a message is removed from the Queue and eventually there would be no more messages so each handler would return a null message reference – causing the loop to break. But HandleStatus.NO_MATCH will cause the process to loop for-ever.

Because QueueImpl.deliver() method is synchronized then any attempt to do anything else with the Queue will block. That includes cleaning up consumers /handlers etc.

I have a simple test program. If I run two instances of, to simulate two clients with 6 threads, I can reproduce this issue ~50% of the time. I'll attach this to the post.

I have searched the other posts and JIRA but couldn't find anything similar. Obviously I can and should raise a JIRA Issue, which I will do, but I need to identify a fix/workaround for this issue in the short term. Does anyone have any suggestions?

TestHornetQSpin.java.zip 1,020 bytes

1. Re: 2.0.0GA Causing Application Server to use 100% CPU

timfox Mar 4, 2010 11:25 AM (in response to stevere)

Thanks, I replicated this locally with two consumers, one is busy and the other has a filter that doesn't match.

We'll take a deeper look next week when we get back from our f2f meeting.
Actions
2. Re: 2.0.0GA Causing Application Server to use 100% CPU

stevere Mar 4, 2010 12:19 PM (in response to timfox)
Okay, thanks for the response Tim.

I've implemented a fix in QueueImpl, that holds back resetting spent message iterators in MessageHandlers with a filtered Consumer until the delivery loop exits.

Obviously I'm not sure if you would judge this an appropriate fix, but it seems to work for me. I've attached a patch file which you can apply to org.hornetq.core.server.impl.QueueImpl.java version 8790 to check the fix out.

QueueImpl.java-patch.zip 627 bytes
Actions
3. Re: 2.0.0GA Causing Application Server to use 100% CPU

saltnlight5 Mar 8, 2010 11:23 PM (in response to timfox)

This bug sounds serious, and I don't see a JIRA filed yet?
Actions
4. Re: 2.0.0GA Causing Application Server to use 100% CPU

stevere Mar 9, 2010 4:16 AM (in response to saltnlight5)

Just got round to creating one

http://jira.jboss.org/jira/browse/HORNETQ-322
Actions

Go to original post