JGroups OOME
matthewlowe Jul 24, 2013 3:46 AMHi all,
I would like to ask you for help in analyzing OOME which I encountered in JGroups. At least I think it's JGroups. In one of our customer's production they encountered OOME on Infinispan in 5 node cluster. When analyzing memory dump, I noticed that there's ~470MB of retained heap held by org.jgroups.blocks.TCPConnectionMap$TCPConnection$Sender.
When I list JVM object by their count, top of the list looks like this
This leads me to suspicion, that messages are not sent and they're piling up on sender which eventually causes OOME. But I can't figure out why this would happen.
Could this issue be caused by some of the cluster nodes being down? Or is there any "natural" scenario in which OOME in JGroups can happen?
This is exception, which according to Yourkit memory profiler caused OOME
Timer-19,_threadNameOmmitted_32726 tid=188 [RUNNABLE] [DAEMON] <--- OutOfMemoryError happened in this thread
java.lang.OutOfMemoryError.<init>()
org.jgroups.blocks.TCPConnectionMap$TCPConnection.send(byte[], int, int)
org.jgroups.blocks.TCPConnectionMap$TCPConnection.access$100(TCPConnectionMap$TCPConnection, byte[], int, int)
org.jgroups.blocks.TCPConnectionMap.send(Address, byte[], int, int)
org.jgroups.protocols.TCP.send(Address, byte[], int, int)
org.jgroups.protocols.BasicTCP.sendUnicast(PhysicalAddress, byte[], int, int)
org.jgroups.protocols.TP.sendToSingleMember(Address, byte[], int, int)
org.jgroups.protocols.TP.doSend(Buffer, Address, boolean)
org.jgroups.protocols.TP.send(Message, Address, boolean)
org.jgroups.protocols.TP.down(Event)
org.jgroups.protocols.Discovery.down(Event)
org.jgroups.protocols.TCPPING.down(Event)
org.jgroups.protocols.MERGE2.down(Event)
org.jgroups.protocols.FD_SOCK.down(Event)
org.jgroups.protocols.FD.down(Event)
org.jgroups.protocols.VERIFY_SUSPECT.down(Event)
org.jgroups.protocols.pbcast.NAKACK.down(Event)
org.jgroups.protocols.UNICAST.retransmit(long, Message)
org.jgroups.stack.AckSenderWindow.retransmit(long, long, Address)
org.jgroups.stack.DefaultRetransmitter$SeqnoTask.callRetransmissionCommand()
org.jgroups.stack.Retransmitter$Task.run()
org.jgroups.util.TimeScheduler2$MyTask.run()
org.jgroups.util.TimeScheduler2$Entry.execute()
org.jgroups.util.TimeScheduler2$1.run()
java.lang.Thread.run()
Any help is much appreciated.
Matthew