2.0.0GA Causing Application Server to use 100% CPU
stevere Mar 4, 2010 7:13 AMFrom the title this sounds drastic, but this is really what I am experiencing. I have to shut the application server down to resolve the issue.
I am running JBoss5/HornetQ 2.0.0 GA on Windows XP Pro. We have a number of remote clients (5) that send JMS messages to one queue and await a response on another queue. Each client starts a filtered consumer on the receive queue so as to receive the intended response. What I am finding is that after a period of time ~1-20 mins, I will begin to see all clients fail, with a message :-
04-Mar-2010 09:32:48 org.hornetq.core.logging.impl.JULLogDelegate warn
WARNING: Connection failure has been detected: Did not receive data from server for org.hornetq.integration.transports.netty.NettyConnection@19f3736[local= /10.1.10.19:2262, remote=remoteserver/10.1.11.34:5445] [code=3]
At the same time CPU usage on the server will jump from ~5% to 100% and the JBoss is consuming most of it. Initially I put this down to perhaps machine load or a network issue, but this isn’t the case, what appears to be happening is that the HornetQ async queue delivery mechanism is spinning :-
Thread [Thread-30 (group:HornetQ-server-threads25045315-29826735)] (Suspended (breakpoint at line 1591 in QueueImpl$FilterMessageHandler)) QueueImpl$FilterMessageHandler.peek(Consumer) line: 1591
QueueImpl.deliver() line: 1137
QueueImpl.access$400(QueueImpl) line: 66
QueueImpl$DeliverRunner.run() line: 1442
OrderedExecutorFactory$OrderedExecutor$1.run() line: 96 T
hreadPoolExecutor$Worker.runTask(Runnable) line: 886
ThreadPoolExecutor$Worker.run() line: 908
Thread.run() line: 619
It appears that the QueueImpl.deliver() method can find itself in a state where it never breaks the delivery loop. Let me explain further: the following code fragment is used to break the delivery loop in QueueImpl:-
if (pos == startPos) {
// We've done all the consumers
if (nullCount + busyCount == totalCount)
{
if (nullCount == totalCount)
{
// We delivered all the messages - go into direct delivery
direct = true;
promptDelivery = false;
}
break;
}
nullCount = busyCount = 0;
}
The first test is used to determine when all handlers have had a chance at processing a message. The second test is the one that will actually break the loop. This I think is suppose to test that there is no more work to do, ( either all messages in the handlers lists have been tried/handled, or the handlers report they are busy ). If one handler reports it could or couldn’t match a message then the counters are reset and the loop continues.
So I believe that if handlers/consumers exist with filters that will never match a message, and one or more handler message iterators are out of step with each other (handler A is on message 5, but handler B is on message 4) then this problem will occur.
How to fix it, I’m not sure. I think the key is that HandleStatus.HANDLED and HandleStatus.NO_MATCH are treated the same way as far as terminating the processing loop is concerned, however HandleStatus.HANDLED would mean that a message is removed from the Queue and eventually there would be no more messages so each handler would return a null message reference – causing the loop to break. But HandleStatus.NO_MATCH will cause the process to loop for-ever.
Because QueueImpl.deliver() method is synchronized then any attempt to do anything else with the Queue will block. That includes cleaning up consumers /handlers etc.
I have a simple test program. If I run two instances of, to simulate two clients with 6 threads, I can reproduce this issue ~50% of the time. I'll attach this to the post.
I have searched the other posts and JIRA but couldn't find anything similar. Obviously I can and should raise a JIRA Issue, which I will do, but I need to identify a fix/workaround for this issue in the short term. Does anyone have any suggestions?
-
TestHornetQSpin.java.zip 1,020 bytes