1 2 Previous Next 19 Replies Latest reply on Aug 24, 2012 6:40 AM by smswamy

Out of memory on 2.2.11

craigm123 Jun 11, 2012 4:20 PM

I'm trying to debug an OOM error that has caused our hornetq deployment to go down. I'm currently using version 2.2.11 in a 2-node static cluster. All of the producers use JmsTemplate and message listeners use the DefaultMessageListenerContainer. The producer sessions use client acknowledgement mode.

I wasn't able to get a heap dump from the VM, but I was able to get a histogram of the memory, as well as a thread dump. Those files are attached.

Does anyone know what would cause the OOM?

Thanks!

Craig

1. Re: Out of memory on 2.2.11

clebert.suconic Jun 12, 2012 1:14 PM (in response to craigm123)

I see a lot of paging happening. Perhaps you have a dead subscription what's causing your messages to build up.

And paging is not infinite if you are using transactions on sending messages.. there are still collections being used to control the transactions.

You should cleanup the dead subscription. Use Print-data and that may help you identify it:

I have a print-data.sh on my path like this:

java -Xmx2G -cp /work/hornetq/trunk/build/jars/hornetq-logging.jar:/work/hornetq/trunk/build/jars/hornetq-core.jar:/work/hornetq/trunk/thirdparty/org/jboss/netty/lib/netty.jar org.hornetq.core.persistence.impl.journal.PrintData $1 $2

Look at the bindings journal where you can look at the bindings created.
Actions
2. Re: Out of memory on 2.2.11

craigm123 Jun 13, 2012 7:49 PM (in response to clebert.suconic)

Thanks for the response.

I tried out the tool and at the moment, when everything is working as expected and I don't see any subscriptions that need to be cleaned up. I will definitely use it when I see the failure again.

At the time of the failure, the message listener was receiving a large amount of network traffic that could have prevented messages from going between the listener and the hornetq server. This could possible have left the dead connection on the server (possibly many if the listener re-connected/disconnected many times). The listener is not using a transacted session, but instead uses client-acknowledgement mode. Would this mode also add to the collections that manage transactions? Is there a way to prevent too many items to be added to the collection?

I understand that the connection-ttl setting controls how fast the dead connections are cleaned-up. I can consider reducing this to something very low to cleanup faster. Would you suggest this?

Craig
Actions
3. Re: Out of memory on 2.2.11

clebert.suconic Jun 14, 2012 12:20 AM (in response to craigm123)

You could also look at the web console. I thought you had something that would get dirty even if you restarted it. You can look at the core-queues you have created.
Actions
4. Re: Out of memory on 2.2.11

ataylor Jun 19, 2012 3:23 AM (in response to clebert.suconic)

The connections look like they are being timed out, you could lower the ttl so this happens earlier altho it would be better to write your clients better behaved. Im not sure but the default spring jms template had some issues with creating connections for every messages ent, I think there is a cached version of this which may be better, or better still wrote your own connect logic so you know what s going on.

Also, lower your paging thresholds if you still see the OOM.
Actions
5. Re: Out of memory on 2.2.11

craigm123 Jun 19, 2012 1:28 PM (in response to ataylor)

Thanks for the response.

Connections were timing out on two of the consumers, that correct. However, I don't think a client that times out should cause a hornetq node to crash with a OOM exception. I do use the CachingConnectionFactory along with JmsTemplate that prevent connections from being shutdown for each message. It also caches session, consumers, and producers, so I don't think that's the issue.

I tried reproducing the situation in a test environment in two ways:

The first attempt involved killing clients while they are receiving large amounts of traffic, then restarting them. This should simulate badly behaved clients that do not close their connections. I tried this many times in quick succession and saw the client resources get cleaned up 60 seconds later (the default TTL). Memory may have increased slightly for a short period of time, but did not even get close to the 4GB heap I had provided it.

The second attempt involved the exact same load and clients. This time I was also restarting the clients to similute them behaving badly. Then I killed one of the two hornetq nodes and memory shot up, eventually causing a OOM exception. I am guessing it is related to the core-queue that is configured between the hornetq nodes as part of clustering.

My paging threshold is set at 10485760 and there are ~25 queues that I have configured, which seems low enough where I shouldn't see a OOM due to that, right?
Actions
6. Re: Out of memory on 2.2.11

clebert.suconic Jun 19, 2012 6:06 PM (in response to craigm123)

If you are using transactions while sending messages, you should still use some objects in memory. If you use transactions we need the transactions objects to guarantee the atomic committs between the journal and the page files.
Actions
7. Re: Out of memory on 2.2.11

clebert.suconic Jun 19, 2012 6:06 PM (in response to clebert.suconic)

^^ that message was in the context of paging.
Actions
8. Re: Out of memory on 2.2.11

craigm123 Jun 19, 2012 6:18 PM (in response to clebert.suconic)

All of my clients do not use transactions, but instead use the client acknowledgement mode. Does this mode hold onto objects when messages are being paged?

Does the core-queue between the cluster nodes use a transacted session? Is it possible the combination of this core-queue, plus paging could lead to an OOM?
Actions
9. Re: Out of memory on 2.2.11

ataylor Jun 20, 2012 7:59 AM (in response to craigm123)

I alse see lote of errors for missing bindings and cluster issues with infinite loops in cluster connections, I would go thru you config and make sure its correct and also make sure both nodes are using its own journalk and also clear the journals themselves.
Actions
10. Re: Out of memory on 2.2.11

clebert.suconic Jun 20, 2012 3:35 PM (in response to ataylor)

A common mistake is users copying the data between nodes which will cause a duplicate nodeID, since the nodeID is an UUID stored on the data folder. That may cause issues.
Actions
11. Re: Out of memory on 2.2.11

craigm123 Jun 21, 2012 1:54 AM (in response to clebert.suconic)
Thanks again for the responses.

I've gone though the cluster configurations and have made sure they are correct. Max-hops is set at 1 and I have one static connector on each node that points to the other. The configuration file above is a copy of both of the ones I have deployed.

I haven't copied the data directory between nodes and looking at the logs the node IDs are definitely different.

I've created a test client that demonstrates that paging in a clustered environment will lead to an OOM exception and paging with a standalone node will not. The setup is very simple - for the first test I have a symmetric cluster using a discovery group, then have a client send large amounts of data at it. It may take some time, but eventually the node not taking traffic will OOM. The second setup is exactly like the first, but just doesn't start the second node.

The test package includes both hornetq nodes and libraries. The scripts that you should execute will start up both hornetq nodes, then the client.

Some instructions:
- Download the package
- tar -xzf hornetq-test-oom.tgz
- cd hornetq-test

To run the clustered test
- ./run-clustered-nolistener.sh
- Wait for the OOM
- Ctrl+C to stop the sender
- ./stop.sh (shuts down the hornetq nodes)
- ./clean.sh (cleans up the data directories, etc.)

To run the standaone test
- ./run-standalone-nolistener.sh
- After the same amount of time as the test above, there is no OOM
- Ctrl+C to stop the sender
- ./stop.sh
- ./clean.sh

hornetq-test-oom.tgz 11.6 MB
Actions
12. Re: Out of memory on 2.2.11

ataylor Jun 21, 2012 3:02 AM (in response to craigm123)

I'll take a look at this today
Actions
13. Re: Out of memory on 2.2.11

ataylor Jun 21, 2012 5:59 AM (in response to ataylor)

How long did it take to get the OOM, I've been running this for 2 hours with no issues at all?
Actions
14. Re: Out of memory on 2.2.11

ataylor Jun 21, 2012 9:00 AM (in response to ataylor)

I ran this for 5 hours with no issues, memory stayed at a constant level. I had no errors in my server logs.

What exactly was your OOM, ive looked in th elog and cant see it.

Also, as i said before there are lots of other errors in your logs, maybe the OOM is just a side effect of one of these, for instance there are lots of

[2012-06-03 14:39:57,892] 340127 WARN [Thread-8 (HornetQ-client-global-threads-1603211900)] org.hornetq.core.server.cluster.impl.ClusterConnectionImpl - MessageFlowRecordImpl [nodeID=b611bdcc-a46a-11e1-9f01-000c29848f2f, connector=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory?port=5444&host=us0101acq001, queueName=sf.my-cluster.b611bdcc-a46a-11e1-9f01-000c29848f2f, queue=QueueImpl[name=sf.my-cluster.b611bdcc-a46a-11e1-9f01-000c29848f2f, postOffice=PostOfficeImpl [server=HornetQServerImpl::serverUUID=331a454d-ad89-11e1-8694-000c29716b68]]@53606bf5, isClosed=false, firstReset=true]::Remote queue binding jms.queue.ConnectionManagerConsumerQueuefd1ac408-ad89-11e1-b190-000c29848f2f has already been bound in the post office. Most likely cause for this is you have a loop in your cluster due to cluster max-hops being too large or you have multiple cluster connections to the same nodes using overlapping addresses

This could be caused by rogue servers or other servers on the network or servers using the same nodeid. I also see los of

[2012-06-03 14:39:59,148] 341383 ERROR [Thread-9 (HornetQ-client-global-threads-1603211900)] org.hornetq.core.server.cluster.impl.ClusterConnectionImpl - Failed to handle message
java.lang.IllegalStateException: Cannot find binding for jms.queue.ABTestEventsfd1ac408-ad89-11e1-b190-000c29848f2f on ClusterConnectionImpl@293359801[nodeUUID=331a454d-ad89-11e1-8694-000c29716b68, connector=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory?port=5444&host=us0101acq002, address=jms, server=HornetQServerImpl::serverUUID=331a454d-ad89-11e1-8694-000c29716b68]
    at org.hornetq.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.doConsumerCreated(ClusterConnectionImpl.java:1458)
    at org.hornetq.core.server.cluster.impl.ClusterConnectionImpl$MessageFlowRecordImpl.onMessage(ClusterConnectionImpl.java:1225)
    at org.hornetq.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:983)
    at org.hornetq.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:48)
    at org.hornetq.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1113)
    at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:100)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
    at java.lang.Thread.run(Thread.java:662)

all in all theres It looks like your cluster isn't being created properly, this could be environment, try a different JDK or try it on a different machine. I also see lots of dead connections in your logs which could be an issue.
Actions

1 2 Previous Next

Go to original post