JBoss 5.1.0 Community version - jgroups holding and not releasing half the oldGen available memory

Version 1

    Have been pulled into a customer that is running a 4 node jboss 5.1.0 cluster for a NetIQ User Application server that provides end users the ability to register accounts, update passwords, and applications the ability via soap to update users and group information etc.  One of the nodes will run great for a few days and then the Old Gen memory will take on a ton of memory and will continue to run because these are large heaps of 4GB but no mater how many times the Old Gen Garbage collection runs it will not release the memory.  This isn't like a slow memory leak with small increases of Old Gen memory it is all of a sudden old gen memory jumps from 12% used after successfully GC to 70% used after GC.  Then the server just gets busy and doesn't have enough memory left to work.

     

    This only happens on 1 of the 4 nodes and is always the same node.  I thought it might move to another node if this one is no longer the primary but every time they have the problem it is on the same node.

     

    Looking at a Heap dump it shows the suspect thread to be:

    The thread java.lang.Thread @ 0x77a8ba7b0 ConnectionTable.Connection.Receiver [10.35.107.10:7805 - 10.35.107.10:7805],prod-idm-partition,10.35.107.10:7803 keeps local variables with total size 1,650,814,856 (78.67%) bytes.

     

    ConnectionTable.Connection.Receiver [10.35.107.10:7805 - 10.35.107.10:7805],prod-idm-partition,10.35.107.10:7803

      at java.net.SocketInputStream.socketRead0(Ljava/io/FileDescriptor;[BIII)I (Native Method)

      at java.net.SocketInputStream.read([BII)I (Unknown Source)

      at java.io.BufferedInputStream.read1([BII)I (Unknown Source)

      at java.io.BufferedInputStream.read([BII)I (Unknown Source)

      at java.io.DataInputStream.readFully([BII)V (Unknown Source)

      at org.jgroups.blocks.BasicConnectionTable$Connection.run()V (BasicConnectionTable.java:662)

      at java.lang.Thread.run()V (Unknown Source)

     

    End result in the log files -- This server runs out of memory and the application stops working.  Which forced them to kill the jboss and restart it.

    2013-10-15 11:12:13,524 INFO  [STDOUT] (ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:7803 - 10.35.107.11:46845],prod-idm-partition,10.35.107.10:7803) 1385192049 [ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:7803 - 10.35.107.11:46845],prod-idm-partition,10.35.107.10:7803] ERROR org.jgroups.blocks.ConnectionTable  - failed sending data to 10.35.107.11:7803: java.net.SocketException: Socket closed

    2013-10-15 11:12:13,530 INFO  [STDOUT] (ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:7803 - 10.53.55.11:47309],prod-idm-partition,10.35.107.10:7803) 1385192055 [ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:7803 - 10.53.55.11:47309],prod-idm-partition,10.35.107.10:7803] ERROR org.jgroups.blocks.ConnectionTable  - failed sending data to 10.53.55.11:7803: java.net.SocketException: Socket closed

    2013-10-15 11:12:19,559 INFO  [STDOUT] (OOB-1,prod-idm-partition,10.35.107.10:7803) 1385198084 [OOB-1,prod-idm-partition,10.35.107.10:7803] WARN org.jgroups.protocols.FD  - I was suspected by 10.53.55.10:7803; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

    2013-10-15 11:12:26,188 INFO  [STDOUT] (ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:24985 - 10.53.55.11:7803],prod-idm-partition,10.35.107.10:7803) 1385204713 [ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:24985 - 10.53.55.11:7803],prod-idm-partition,10.35.107.10:7803] ERROR org.jgroups.blocks.ConnectionTable  - failed sending data to 10.53.55.11:7803: java.net.SocketException: Socket closed

     

    2013-10-15 11:13:42,162 INFO  [STDOUT] (ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:7803 - 10.53.55.10:44342],prod-idm-partition,10.35.107.10:7803) 1385280686 [ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:7803 - 10.53.55.10:44342],prod-idm-partition,10.35.107.10:7803] ERROR org.jgroups.blocks.ConnectionTable  - failed sending data to 10.53.55.10:7803: java.net.SocketException: Socket closed

    2013-10-15 11:14:42,209 INFO  [STDOUT] (ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:46442 - 10.53.55.11:7803],prod-idm-partition,10.35.107.10:7803) 1385340734 [ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:46442 - 10.53.55.11:7803],prod-idm-partition,10.35.107.10:7803] ERROR org.jgroups.blocks.ConnectionTable  - failed sending data to 10.53.55.11:7803: java.net.SocketException: Broken pipe

     

    2013-10-15 11:22:14,399 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[localhost].[/IDM].[UIQuery]] (http-0.0.0.0-8180-39) Servlet.service() for servlet UIQuery threw exception

    java.lang.OutOfMemoryError: Java heap space

    2013-10-15 11:22:35,685 INFO  [STDOUT] (Thread-47) INFO  [RBPM] [com.novell.soa.af.impl.core.EngineImpl:setState] Workflow Engine setState: [STOPPING]

    2013-10-15 11:22:35,685 INFO  [STDOUT] (Thread-47) 1385814209 [Thread-47] INFO com.novell.soa.af.impl.core.EngineImpl  - Workflow Engine setState: [STOPPING]

    2013-10-15 11:22:35,679 ERROR [org.apache.catalina.core.ContainerBase.[jboss.web].[localhost].[/IDM].[UIQuery]] (http-0.0.0.0-8180-23) Servlet.service() for servlet UIQuery threw exception

    java.lang.OutOfMemoryError: Java heap space

    2013-10-15 11:22:55,487 INFO  [STDOUT] (ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:31766 - 10.53.55.10:7803],prod-idm-partition,10.35.107.10:7803) 1385834012 [ConnectionTable.Connection.Sender local_addr=10.35.107.10:7803 [10.35.107.10:31766 - 10.53.55.10:7803],prod-idm-partition,10.35.107.10:7803] ERROR org.jgroups.blocks.ConnectionTable  - failed sending data to 10.53.55.10:7803: java.net.SocketException: Socket closed

    2013-10-15 11:23:41,996 WARN  [org.jgroups.protocols.FC] (Incoming-11,10.35.107.10:7600) Received two credit requests from 10.35.107.11:7600 without any intervening messages; sending 1998924 credits

    2013-10-15 11:23:41,996 WARN  [org.jgroups.protocols.FC] (Incoming-11,10.35.107.10:7600) Received two credit requests from 10.35.107.11:7600 without any intervening messages; sending 1998924 credits

    2013-10-15 11:23:41,997 WARN  [org.jgroups.protocols.FC] (Incoming-11,10.35.107.10:7600) Received two credit requests from 10.35.107.11:7600 without any intervening messages; sending 1998924 credits

    2013-10-15 11:23:41,997 WARN  [org.jgroups.protocols.FC] (Incoming-11,10.35.107.10:7600) Received two credit requests from 10.35.107.11:7600 without any intervening messages; sending 1998924 credits

    2013-10-15 11:23:48,711 WARN  [org.jgroups.protocols.FC] (Incoming-11,10.35.107.10:7600) Received two credit requests from 10.35.107.11:7600 without any intervening messages; sending 1998924 credits

     

    Start up options:

     

    JAVA_OPTS="-server \

    -Xms4096m \

    -Xmx4096m \

    -XX:MaxPermSize=256m \

    -XX:+UseParallelGC \

    -XX:+UseParallelOldGC \

    -Djava.awt.headless=true \

    -Dfile.encoding=UTF-8 \

    -Dsun.jnu.encoding=UTF-8 \

    -Dsun.rmi.dgc.client.gcInterval=3600000 \

    -Dsun.rmi.dgc.server.gcInterval=3600000 \

    -Dcom.sun.management.jmxremote \

    -Dcom.sun.management.jmxremote.port=6002 \

    -Dcom.sun.management.jmxremote.authenticate=false \

    -Dcom.sun.management.jmxremote.ssl=false \

    -Djava.rmi.server.hostname=10.35.107.10 \

    -Djboss.default.jgroups.stack=tcp \

    -Djgroups.bind_addr=10.35.107.10 \

    -Dnovell.jgroups.tcp.tcpping.initial_hosts=10.35.107.10[7803],10.35.107.11[7803],10.53.55.10[7803],10.53.55.11[7803] \

    -Dcom.novell.afw.wf.engine-id=rbpmdbedc1"

    export JAVA_OPTS

     

    Any ideas?

     

    Thanks