I see a lot of paging happening. Perhaps you have a dead subscription what's causing your messages to build up.
And paging is not infinite if you are using transactions on sending messages.. there are still collections being used to control the transactions.
You should cleanup the dead subscription. Use Print-data and that may help you identify it:
I have a print-data.sh on my path like this:
java -Xmx2G -cp /work/hornetq/trunk/build/jars/hornetq-logging.jar:/work/hornetq/trunk/build/jars/hornetq-core.jar:/work/hornetq/trunk/thirdparty/org/jboss/netty/lib/netty.jar org.hornetq.core.persistence.impl.journal.PrintData $1 $2
Look at the bindings journal where you can look at the bindings created.
Thanks for the response.
I tried out the tool and at the moment, when everything is working as expected and I don't see any subscriptions that need to be cleaned up. I will definitely use it when I see the failure again.
At the time of the failure, the message listener was receiving a large amount of network traffic that could have prevented messages from going between the listener and the hornetq server. This could possible have left the dead connection on the server (possibly many if the listener re-connected/disconnected many times). The listener is not using a transacted session, but instead uses client-acknowledgement mode. Would this mode also add to the collections that manage transactions? Is there a way to prevent too many items to be added to the collection?
I understand that the connection-ttl setting controls how fast the dead connections are cleaned-up. I can consider reducing this to something very low to cleanup faster. Would you suggest this?
You could also look at the web console. I thought you had something that would get dirty even if you restarted it. You can look at the core-queues you have created.
The connections look like they are being timed out, you could lower the ttl so this happens earlier altho it would be better to write your clients better behaved. Im not sure but the default spring jms template had some issues with creating connections for every messages ent, I think there is a cached version of this which may be better, or better still wrote your own connect logic so you know what s going on.
Also, lower your paging thresholds if you still see the OOM.
Thanks for the response.
Connections were timing out on two of the consumers, that correct. However, I don't think a client that times out should cause a hornetq node to crash with a OOM exception. I do use the CachingConnectionFactory along with JmsTemplate that prevent connections from being shutdown for each message. It also caches session, consumers, and producers, so I don't think that's the issue.
I tried reproducing the situation in a test environment in two ways:
The first attempt involved killing clients while they are receiving large amounts of traffic, then restarting them. This should simulate badly behaved clients that do not close their connections. I tried this many times in quick succession and saw the client resources get cleaned up 60 seconds later (the default TTL). Memory may have increased slightly for a short period of time, but did not even get close to the 4GB heap I had provided it.
The second attempt involved the exact same load and clients. This time I was also restarting the clients to similute them behaving badly. Then I killed one of the two hornetq nodes and memory shot up, eventually causing a OOM exception. I am guessing it is related to the core-queue that is configured between the hornetq nodes as part of clustering.
My paging threshold is set at 10485760 and there are ~25 queues that I have configured, which seems low enough where I shouldn't see a OOM due to that, right?
If you are using transactions while sending messages, you should still use some objects in memory. If you use transactions we need the transactions objects to guarantee the atomic committs between the journal and the page files.
^^ that message was in the context of paging.
All of my clients do not use transactions, but instead use the client acknowledgement mode. Does this mode hold onto objects when messages are being paged?
Does the core-queue between the cluster nodes use a transacted session? Is it possible the combination of this core-queue, plus paging could lead to an OOM?
I alse see lote of errors for missing bindings and cluster issues with infinite loops in cluster connections, I would go thru you config and make sure its correct and also make sure both nodes are using its own journalk and also clear the journals themselves.
A common mistake is users copying the data between nodes which will cause a duplicate nodeID, since the nodeID is an UUID stored on the data folder. That may cause issues.
Thanks again for the responses.
I've gone though the cluster configurations and have made sure they are correct. Max-hops is set at 1 and I have one static connector on each node that points to the other. The configuration file above is a copy of both of the ones I have deployed.
I haven't copied the data directory between nodes and looking at the logs the node IDs are definitely different.
I've created a test client that demonstrates that paging in a clustered environment will lead to an OOM exception and paging with a standalone node will not. The setup is very simple - for the first test I have a symmetric cluster using a discovery group, then have a client send large amounts of data at it. It may take some time, but eventually the node not taking traffic will OOM. The second setup is exactly like the first, but just doesn't start the second node.
The test package includes both hornetq nodes and libraries. The scripts that you should execute will start up both hornetq nodes, then the client.
- Download the package
- tar -xzf hornetq-test-oom.tgz
- cd hornetq-test
To run the clustered test
- Wait for the OOM
- Ctrl+C to stop the sender
- ./stop.sh (shuts down the hornetq nodes)
- ./clean.sh (cleans up the data directories, etc.)
To run the standaone test
- After the same amount of time as the test above, there is no OOM
- Ctrl+C to stop the sender
hornetq-test-oom.tgz 11.6 MB
I'll take a look at this today
How long did it take to get the OOM, I've been running this for 2 hours with no issues at all?
I ran this for 5 hours with no issues, memory stayed at a constant level. I had no errors in my server logs.
What exactly was your OOM, ive looked in th elog and cant see it.
Also, as i said before there are lots of other errors in your logs, maybe the OOM is just a side effect of one of these, for instance there are lots of
[2012-06-03 14:39:57,892] 340127 WARN [Thread-8 (HornetQ-client-global-threads-1603211900)] org.hornetq.core.server.cluster.impl.ClusterConnectionImpl - MessageFlowRecordImpl [nodeID=b611bdcc-a46a-11e1-9f01-000c29848f2f, connector=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory?port=5444&host=us0101acq001, queueName=sf.my-cluster.b611bdcc-a46a-11e1-9f01-000c29848f2f, queue=QueueImpl[name=sf.my-cluster.b611bdcc-a46a-11e1-9f01-000c29848f2f, postOffice=PostOfficeImpl [server=HornetQServerImpl::serverUUID=331a454d-ad89-11e1-8694-000c29716b68]]@53606bf5, isClosed=false, firstReset=true]::Remote queue binding jms.queue.ConnectionManagerConsumerQueuefd1ac408-ad89-11e1-b190-000c29848f2f has already been bound in the post office. Most likely cause for this is you have a loop in your cluster due to cluster max-hops being too large or you have multiple cluster connections to the same nodes using overlapping addresses
This could be caused by rogue servers or other servers on the network or servers using the same nodeid. I also see los of
[2012-06-03 14:39:59,148] 341383 ERROR [Thread-9 (HornetQ-client-global-threads-1603211900)] org.hornetq.core.server.cluster.impl.ClusterConnectionImpl - Failed to handle message
java.lang.IllegalStateException: Cannot find binding for jms.queue.ABTestEventsfd1ac408-ad89-11e1-b190-000c29848f2f on ClusterConnectionImpl@293359801[nodeUUID=331a454d-ad89-11e1-8694-000c29716b68, connector=org-hornetq-core-remoting-impl-netty-NettyConnectorFactory?port=5444&host=us0101acq002, address=jms, server=HornetQServerImpl::serverUUID=331a454d-ad89-11e1-8694-000c29716b68]
all in all theres It looks like your cluster isn't being created properly, this could be environment, try a different JDK or try it on a different machine. I also see lots of dead connections in your logs which could be an issue.