7 Replies Latest reply on Aug 8, 2011 3:49 AM by dgmatos

Excess of notification queues and consequent out of memory (OOM)

dgmatos Aug 4, 2011 11:00 AM

We are facing an OOM issue with HornetQ in our preprod environment while running long term tests.

Our setup in preprod:

JBoss 4.2.3.GA

HornetQ 2.2.5-Final

By analyzing the memory dumps from the server, it shows up the exact same problem as described before by Jeroen:

https://issues.jboss.org/browse/HORNETQ-661

Same class (org.hornetq.utils.concurrent.HornetQConcurrentLinkedQueue), same notification (CONSUMER_CREATED), same queues (notif.<UUID>)

We have tried other possible configurations and tips from many other posts: latest HornetQ version, broadcast/discovery groups enabled with small broadcast period and larger discovery refresh timeout, static discovery, larger tcp buffers, enable nio, etc, etc with no positive results. We are trying to isolate the problem as further we can.

Upgrading JBoss to version 6, where HornetQ is the default messaging system, is on the table, but it will require a huge effort and there is no guaranty it will solve the issue.

It is a time consuming problem to solve, as it takes a few days of operation to have it reproduced. In order to try to speed up reproducibility and have an indicator if the upgrade would solve our problem following test was run:

Setup - Two clean instances of JBoss 6.1 (downloaded from the nightly builds of yesterday) having HornetQ 2.2.5-Final by default. Both instances start fine and the HornetQ cluster is up (check attached log).

Then, a loss of connection between the nodes has been simulated with the command:

> iptables -A OUTPUT -p tcp --dport 5447 -j DROP

At this stage, after some seconds the connection failure is correctly detected by HornetQ. Then the traffic has been reestablished by dropping the iptable rule:

> iptables -D OUTPUT 1

All seems to work fine then, except that on JMX console (node2) is seen one extra notification queue is now active (the "old" one is still there, and the Address='hornetq.notifications' addresses both). JMS speaking, both TOPICS will receive notifications messages, while only the last one appears to have a consumer associated. The other seam to accumulate notifications until OOM condition is reached.

By repeating the connection failure simulation steps, a new notification queue is created (w/ 1 consumer associated) and the "old" ones are not cleaned (stay with no consumers). Same thing happens when the connection failure is simulated on the opposite connection (in my example by blocking TCP port 5445), having notification queues pilling up in the first node.

On other hand, it is possible we are facing network issues (maybe a broken switch or hub) and we have network specialists trying to figure out the problem there. The fact is that eventually HornetQ does loose the connection between nodes and reestablishes it. Anyway imho HornetQ as a reliable queuing system should cope well with this kind of issues and survive clean.

This is a blocking issue for quality to approve the new release of our product, so any help/comment on this is appreciated.

At the end of the day, here are my thoughts:

When network issues happen (broken connections), old resources like notification queues should be clean up.
In alternative, if queues are not clean up, at least the address 'hornetq.notifications' should not address old notification queues to prevent these ones to get new notifications.

Thanks in advance for your comments.

node2-server.log.zip 1.5 KB
node2-hornetq-jms.xml 883 bytes
node2-hornetq-configuration.xml.zip 1.4 KB
node1-server.log.zip 1.3 KB
node1-hornetq-jms.xml 883 bytes
node1-hornetq-configuration.xml.zip 1.4 KB

1. Re: Excess of notification queues and consequent out of memory (OOM)

clebert.suconic Aug 4, 2011 11:32 AM (in response to dgmatos)

I'm doing some work on clustering as we speak...

I'm fixing an issue on the ClusterConnectionBridge:

The createQueue here:
         session.createQueue(managementNotificationAddress, notifQueueName, filter, false);

should been

         session.createTemporaryQueue(managementNotificationAddress, notifQueueName, filter);

         notifConsumer = session.createConsumer(notifQueueName);

I'm currently working on my own parallel branch at: https://svn.jboss.org/repos/hornetq/branches/Branch_2_2_EAP_cluster_clean2/

Do you think you could make a test on this branch? I almost merging it on Branch_2_2_EAP, Branch_2_2_AS7 and trunk

Or if you wish, you could just modify these two lines in your own version and this problem will go away.
Actions
2. Re: Excess of notification queues and consequent out of memory (OOM)

dgmatos Aug 4, 2011 12:00 PM (in response to clebert.suconic)

Hi Clebert.

I have applied your fix in the source and re-compile HornetQ.

It looks very promising, as the exception on node1 disappears.
ERROR [org.hornetq.core.protocol.core.ServerSessionPacketHandler] (New I/O server worker #1-3) Caught exception:
HornetQException[errorCode=101 message=Queue notif.97cf41fd-be9d-11e0-a39a-0017085bc077 already exists]

My first analysis is that your fix actually solves half of the issue. It does clean a notification queue, allowing a next one to be created (with the same name and no exception thrown). The reason queues were not pilling up also in node1 was because they have the same name, therefore only one stayed anyway.

But in node2 nothing changes. Two queues are present after the failure with different names.

notif.22a7795d-beaf-11e0-9c6d-0017085bc077
notif.9b374632-beaf-11e0-9c6d-0017085bc077
Actions
3. Re: Excess of notification queues and consequent out of memory (OOM)

clebert.suconic Aug 4, 2011 12:35 PM (in response to dgmatos)

How can you get a queue already exists here? the notification queue is creating a new queue every time.

It seems you have network isssues?

Anyway, at the new branch I'm working on, each Bridge will have its own connection, what means the temporary queue should go away when the connection dies.

Can you try this branch, as a test?

http://anonsvn.jboss.org/repos/hornetq/branches/Branch_2_2_EAP_cluster_clean2/
Actions
4. Re: Excess of notification queues and consequent out of memory (OOM)

clebert.suconic Aug 4, 2011 12:46 PM (in response to clebert.suconic)

I mean, this is how the notification queue is created on 2.2.5:

String qName = "notif." + UUIDGenerator.getInstance().generateStringUUID();

SimpleString notifQueueName = new SimpleString(qName);

So, the queue will have a new name every time
Actions
5. Re: Excess of notification queues and consequent out of memory (OOM)

dgmatos Aug 5, 2011 5:23 AM (in response to clebert.suconic)
I have run tests with the branch where you are currently working.

Both nodes start ok, each with one notification queue. Then, I simulate for 2 minutes packet loss from node1 to node2. The end result is two notification queues in node1 and zero on node2. Sequent packet loss simulations seems to have no affect anymore on the cluster, which is not surprising if we look at the netstats below – no reconnection happens between those nodes, looks like the cluster gets into an inconsistent state (is it?)

Find below the exact commands executed for this scenario and the server logs in attachment:
(5445: acceptor on node1, 5447: acceptor on node 2)

# netstat -anp | grep 5445
tcp        0      0 0.0.0.0:5445                0.0.0.0:*                   LISTEN      15671/java
tcp        0      0 138.203.248.103:52219       138.203.248.103:5445        ESTABLISHED 15746/java
tcp        0      0 138.203.248.103:52218       138.203.248.103:5445        ESTABLISHED 15746/java
tcp        0      0 138.203.248.103:5445        138.203.248.103:52219       ESTABLISHED 15671/java
tcp        0      0 138.203.248.103:5445        138.203.248.103:52218       ESTABLISHED 15671/java
# netstat -anp | grep 5447
tcp        0      0 0.0.0.0:5447                0.0.0.0:*                   LISTEN      15746/java
tcp        0      0 138.203.248.103:5447        138.203.248.103:54812       ESTABLISHED 15746/java
tcp        0      0 138.203.248.103:5447        138.203.248.103:54815       ESTABLISHED 15746/java
tcp        0      0 138.203.248.103:54812       138.203.248.103:5447        ESTABLISHED 15671/java
tcp        0      0 138.203.248.103:54815       138.203.248.103:5447        ESTABLISHED 15671/java
# iptables -A OUTPUT -p tcp --dport 5447 -j DROP
… (wait 2 min) …
# iptables -D OUTPUT 1
… (wait 2 min) …
# netstat -anp | grep 5445
tcp        0      0 0.0.0.0:5445                0.0.0.0:*                   LISTEN      15671/java
tcp        0      0 138.203.248.103:5445        138.203.248.103:49723       ESTABLISHED 15671/java
tcp        0      0 138.203.248.103:49723       138.203.248.103:5445        ESTABLISHED 15671/java
tcp        0      0 138.203.248.103:52219       138.203.248.103:5445        ESTABLISHED 15746/java
tcp        0      0 138.203.248.103:52218       138.203.248.103:5445        ESTABLISHED 15746/java
tcp        0      0 138.203.248.103:5445        138.203.248.103:52219       ESTABLISHED 15671/java
tcp        0      0 138.203.248.103:5445        138.203.248.103:52218       ESTABLISHED 15671/java
# netstat -anp | grep 5447
tcp        0      0 0.0.0.0:5447                0.0.0.0:*                   LISTEN      15746/java

I hope gives you valuable input.
I can re-test if you update the branch. Let me know.

node2-server.log.zip 1.8 KB

node1-server.log.zip 1.5 KB
Actions
6. Re: Excess of notification queues and consequent out of memory (OOM)

clebert.suconic Aug 5, 2011 9:47 AM (in response to dgmatos)

I didn't do your test yet.. but can you update and do another test?

I have re-enabled confirmationWindowSize for reattachs. (something else I fixed).

Also: TCP was supposed to be reliable. I'm not really sure your test is valid. It's not the same as removing the cable.
Actions
7. Re: Excess of notification queues and consequent out of memory (OOM)

dgmatos Aug 8, 2011 3:49 AM (in response to clebert.suconic)

Tested with revision 11143 and the result is the same as the last test.

I understand your concern about the validity of the test and I think it is debatable. The goal is not to test a cable disconnection, but a transitory network issues (as hardware glitch, packet collision, etc) while keeping the link up. It is my belief this is a valid real operational scenario, but then I am not the network specialist.
Actions

Go to original post