I have tried various Sun JVM versions, including u18, with the same results. One interesting thing to note is that this is related to the JBossWS stack. When I turn off all web service access to my SLSBs and hit the cluster only via RMI, the leak goes away. If anyone has any ideas as to what may cause a native (non Java heap/non java Permgen) leak in this scenario, please pass it on, even if it is a brainstorm.
I will try to create a simple dummy application and web service client which reproduces the problem. Unfortunatly I was unable to get the JVM running under valgrind, even following this FAQ: http://valgrind.org/docs/manual/faq.html#faq.java
VisualVM, which comes with JDK 6, has some decent tools for tracking down memory leaks. It might help you pinpoint where the leak i, and if it really is related to web services you could submit a JIRA.
Oh, the other thing you should do is check the latest JBossWS releases to see if any of them fixes a memory leak.
I'm not 100% sure offhand what is involved in the code path for JBossWS, but maybe check and see if direct buffer space is being exhausted?
I have finally found a workaround, using the standard OpenJDK that comes with Centos 5: java-1.6.0-openjdk-188.8.131.52-1.7.b09.el5. Using this JDK fixes the leak completely. VIRT size stabilizes at 1833M, which is exactly what I would expect. So I have no idea if this version of OpenJDK contains a fix, or the newer commercial JDKs from Sun contain a regression - as I still see the problem with JDK 6 update 19.
The crazy part is that no one else seems to have had such a problem. My application is rather busy - up to 50 transactions/second, but it still confounds that no one else seems to see this native leak. Anyway, thanks for your replies. I would still like to figure out what is going on, but for now, I will just be happy I have found a way to stabilize my cluster.
I'm wondering if you ever got to the bottom of this. I'm seeing the same thing in 1.6.0_26 (I know, that's pretty old) and only with an application that receive lots of GETs on our Netty server. I checked lsof on a long running vs new application and it doesn't look like we're leaking file handles (connections). So feels JVM level.
Did you upgrade to JDK7 or use OpenJDK to fix this?
I would analyze the heap first, if that is fruitless look into JDK 1.6 latest update and if that still doesn't help move to JDK 1.7.
I recently found this issue at:
approximately 1/3 down the page, states that:
Do not Use Hypersonic in ProductionAlthough Hypersonic configuration is used as the default persistence configuration, Hypersonic is not suitable or supported in production due to the following known issues:
- no transaction isolation
- thread and socket leaks (
connection.close()does not tidy up resources)
- persistence quality (logs commonly become corrupted after a failure, preventing automatic recovery)
- database corruption
- stability under load (database processes cease when dealing with too much data)
- not viable in clustered environments
The Hypersonic database is intended for developing and testing purposes and should not be used in a production environment. For more information about recommended databases, refer to the Using Other Databases chapter in the Getting Started Guide.
So you may want to switch to a different JMS sql store, if this is a production server.
"The crazy part is that no one else seems to have had such a problem."
I have a desktop app (not a JBOSS server app, but a very complicated multithreaded app in industrial automation) running on Centos and I experienced something like Jon's problem: running on Centos, large memory allocation climb bogging down the app while used memory stays flat and at reasonable level. I was about to try Jon's fix of reverting to the Centos OpenJDK instead of the Oracle JVM I'm using when I noticed some objects in the heap spiking up in count fairly high before getting released and garbage collected. I'm talking about spikes of a few thousand small objects in a Hashtable. If I had taken more data perhaps I would have found large spikes of several hundred MB; I don't know. I changed the algorithm in my code so as not to have more than ten or twenty such objects at a time instead of thousands. The allocated memory issue disappeared. Is the JVM getting spooked by spikes in used memory, allocating too much memory, not decreasing the allocating after the spike disappears, and then bogging down the app? Does the Centos OpenJDK Jon used handle such spikes more skillfully?