We are facing an OOM issue with HornetQ in our preprod environment while running long term tests.
Our setup in preprod:
By analyzing the memory dumps from the server, it shows up the exact same problem as described before by Jeroen:
Same class (org.hornetq.utils.concurrent.HornetQConcurrentLinkedQueue), same notification (CONSUMER_CREATED), same queues (notif.<UUID>)
We have tried other possible configurations and tips from many other posts: latest HornetQ version, broadcast/discovery groups enabled with small broadcast period and larger discovery refresh timeout, static discovery, larger tcp buffers, enable nio, etc, etc with no positive results. We are trying to isolate the problem as further we can.
Upgrading JBoss to version 6, where HornetQ is the default messaging system, is on the table, but it will require a huge effort and there is no guaranty it will solve the issue.
It is a time consuming problem to solve, as it takes a few days of operation to have it reproduced. In order to try to speed up reproducibility and have an indicator if the upgrade would solve our problem following test was run:
Setup - Two clean instances of JBoss 6.1 (downloaded from the nightly builds of yesterday) having HornetQ 2.2.5-Final by default. Both instances start fine and the HornetQ cluster is up (check attached log).
Then, a loss of connection between the nodes has been simulated with the command:
> iptables -A OUTPUT -p tcp --dport 5447 -j DROP
At this stage, after some seconds the connection failure is correctly detected by HornetQ. Then the traffic has been reestablished by dropping the iptable rule:
> iptables -D OUTPUT 1
All seems to work fine then, except that on JMX console (node2) is seen one extra notification queue is now active (the "old" one is still there, and the Address='hornetq.notifications' addresses both). JMS speaking, both TOPICS will receive notifications messages, while only the last one appears to have a consumer associated. The other seam to accumulate notifications until OOM condition is reached.
By repeating the connection failure simulation steps, a new notification queue is created (w/ 1 consumer associated) and the "old" ones are not cleaned (stay with no consumers). Same thing happens when the connection failure is simulated on the opposite connection (in my example by blocking TCP port 5445), having notification queues pilling up in the first node.
On other hand, it is possible we are facing network issues (maybe a broken switch or hub) and we have network specialists trying to figure out the problem there. The fact is that eventually HornetQ does loose the connection between nodes and reestablishes it. Anyway imho HornetQ as a reliable queuing system should cope well with this kind of issues and survive clean.
This is a blocking issue for quality to approve the new release of our product, so any help/comment on this is appreciated.
At the end of the day, here are my thoughts:
- When network issues happen (broken connections), old resources like notification queues should be clean up.
- In alternative, if queues are not clean up, at least the address 'hornetq.notifications' should not address old notification queues to prevent these ones to get new notifications.
Thanks in advance for your comments.