1 Reply Latest reply on May 8, 2015 8:46 AM by jgabrielygalan

CPU Spikes and phantom messages HornetQ Standalone

jgabrielygalan May 8, 2015 5:23 AM

Hi,

We have been using HornetQ for a while, and yesterday we started suffering a problem. We use a 2.4.0-Final standalone clustered server, with this configuration for addresses:

<address-setting match="#">

<max-size-bytes>104857600</max-size-bytes>

<page-size-bytes>10485760</page-size-bytes>

<address-full-policy>PAGE</address-full-policy>

<redelivery-delay-multiplier>1.5</redelivery-delay-multiplier>

<redelivery-delay>5000</redelivery-delay>

<max-redelivery-delay>50000</max-redelivery-delay>

</address-setting>

<address-setting match="billing-platform-notifications">

<dead-letter-address>dead-letter</dead-letter-address>

<expiry-address>expired-msgs</expiry-address>

<redelivery-delay>1000</redelivery-delay>

<expiry-delay>259200000</expiry-delay>

</address-setting>

To explain a little bit the queue topology we have: we have many producers sending messages to the above address. Directly attached to that address we have 3 durable queues with different filters. We also have around 5 non-exclusive diverts that copy the messages in JMS addresses (each with one queue, obv).

At some point, the server started using a lot of CPU, and consumers started to consume very slowly. We killed the master to force a failover and a similar scenario happened in the other machine. We stopped the consumer processes, and forced the failover a couple more times until the situation stabilized, but with a weird effect. All queues showed some messages in the message count but those messages were not being consumed. Via JMX, we tried to move them to another queue, expire them, move them to deadletter with absolutely no effect (those operations reported 0 messages). We also tried to call the reset message counter operation, and nothing changed either.

We were trying not to lose the messages, so as the situation was stabilized in terms of CPU usage, we decided to wait 3 days for the expiration time, to see if they would be expired and moved into the expiry-msgs queue, where we could process them and recover them. But last night, the instability start again, with a CPU spike. I'm attaching a thread dump, a heap dump and a snapshot of jvisualvm at the time. It looks like something is wrong with the Paging, since there are around 550MB in 3 instances of org.hornetq.core.paging.cursor.impl.PageSubscriptionImpl. You can see in jvisualvm that the GC is using a lot of CPU (stable around 50% CPU), but the old generation is not freed. It's nearly always at 90% capacity.

Is there anything we can do to understand what is the problem, and how to avoid it in the future?

Is there anything we can do to recover those messages?

Any other tip to break out of this situation?

Thanks,

Jesus.

1. Re: CPU Spikes and phantom messages HornetQ Standalone

jgabrielygalan May 8, 2015 8:46 AM (in response to jgabrielygalan)

An update: the problem was only on the main address. The other addresses that are fed with the diverts were working correctly. Considering the messages already lost, we tried to remove them from the queues by dropping the queues and creating them again. When dropping the first of the three queues, the messages were removed from all of them, and we got this error in the logs:

12:25:17,501 WARN [org.hornetq.core.server] HQ222033: Page file 000000088.page had incomplete records at position 6,422,524 at record number 13,300
12:25:17,503 WARN [org.hornetq.core.server] HQ222030: File 000000088.page being renamed to 000000088.page.invalidPage as it was loaded partially. Please verify your data.

Now everything seems to be ok, although messages are being delivered quite slowly, or in bursts, seems that acking the messages is slow or something like that.

I hope this helps diagnosing the problem and understanding how to avoid it.

Thanks,

Jesus.
Actions