We are trying to swap out JBM for HornetQ, and have it running in our system test
environment at the moment. HornetQ is in standalone, non-clustered mode with most
default configuration settings still in place.
The system is set up as services, each with a
queue and a consumer pool. Consumers process messages in JMS transactions
(session transacted). When problem occurs while processing, then the JMS
transaction is completed, normally, but a message is sent back to the same
queue, but with the _HQ_SCHED_DELIVERY property set to a time in the future.
The idea is that the same request is attempted to be processed again at a later
stage. We control the max scheduled redelivery attempts manually, and back off
the redelivery time manually.
We had a lot of these failures today due to an external resource that was
failing. This caused many such scheduled messages to pile up in the queue.
When the resource was restored, the scheduled messages started getting
processed, as their scheduled times arrived. However, at the end there were
still 8 messages in the "DeliveringCount". Even if I restart all consumers,
this doesn't change. The state is currently:
What could cause the messages to stay in the "delivering" state, even if all
consumers are restarted?
Some more background:
- initially hornet was configured to BLOCK instead of PAGE, which caused
the system to hang once the dead letter queue filled up, as well as the service
queue that had problems (because we manuallysend to the DLQ once the
redelivery attempts run out). In this blocked state, many queues had
high "DeliveringCount" values.
- I then changed it to PAGE, which loosened up the system and let the
messages move to the DLQ, and new messages could be accepted into the
- While the system was processing these messages, I stopped hornet,
increased the max size and max page size, and started hornetq again.
There were some warnings like this:
WARNING [org.hornetq.core.paging.cursor.impl.PageSubscriptionImpl] Couldn't locate page transaction 42734785, ignoring message on position PagePositionImpl [pageNr=1, messageNr=14, recordID=0]
Maybe this page size increase could have caused some messages to be