4 Replies Latest reply on Jul 26, 2013 4:12 AM by matthewlowe

JGroups OOME

matthewlowe Jul 24, 2013 3:46 AM

Hi all,

I would like to ask you for help in analyzing OOME which I encountered in JGroups. At least I think it's JGroups. In one of our customer's production they encountered OOME on Infinispan in 5 node cluster. When analyzing memory dump, I noticed that there's ~470MB of retained heap held by org.jgroups.blocks.TCPConnectionMap$TCPConnection$Sender.

When I list JVM object by their count, top of the list looks like this

This leads me to suspicion, that messages are not sent and they're piling up on sender which eventually causes OOME. But I can't figure out why this would happen.

Could this issue be caused by some of the cluster nodes being down? Or is there any "natural" scenario in which OOME in JGroups can happen?

This is exception, which according to Yourkit memory profiler caused OOME

Timer-19,_threadNameOmmitted_32726 tid=188 [RUNNABLE] [DAEMON] <--- OutOfMemoryError happened in this thread

java.lang.OutOfMemoryError.<init>()

org.jgroups.blocks.TCPConnectionMap$TCPConnection.send(byte[], int, int)

org.jgroups.blocks.TCPConnectionMap$TCPConnection.access$100(TCPConnectionMap$TCPConnection, byte[], int, int)

org.jgroups.blocks.TCPConnectionMap.send(Address, byte[], int, int)

org.jgroups.protocols.TCP.send(Address, byte[], int, int)

org.jgroups.protocols.BasicTCP.sendUnicast(PhysicalAddress, byte[], int, int)

org.jgroups.protocols.TP.sendToSingleMember(Address, byte[], int, int)

org.jgroups.protocols.TP.doSend(Buffer, Address, boolean)

org.jgroups.protocols.TP.send(Message, Address, boolean)

org.jgroups.protocols.TP.down(Event)

org.jgroups.protocols.Discovery.down(Event)

org.jgroups.protocols.TCPPING.down(Event)

org.jgroups.protocols.MERGE2.down(Event)

org.jgroups.protocols.FD_SOCK.down(Event)

org.jgroups.protocols.FD.down(Event)

org.jgroups.protocols.VERIFY_SUSPECT.down(Event)

org.jgroups.protocols.pbcast.NAKACK.down(Event)

org.jgroups.protocols.UNICAST.retransmit(long, Message)

org.jgroups.stack.AckSenderWindow.retransmit(long, long, Address)

org.jgroups.stack.DefaultRetransmitter$SeqnoTask.callRetransmissionCommand()

org.jgroups.stack.Retransmitter$Task.run()

org.jgroups.util.TimeScheduler2$MyTask.run()

org.jgroups.util.TimeScheduler2$Entry.execute()

org.jgroups.util.TimeScheduler2$1.run()

java.lang.Thread.run()

Any help is much appreciated.

Matthew

1. Re: JGroups OOME

mircea.markus Jul 24, 2013 12:58 PM (in response to matthewlowe)

Seems like the sender ndoe queues up messages, is the receiving node overwhelmed?
Actions
2. Re: JGroups OOME

sannegrinovero Jul 25, 2013 6:07 AM (in response to mircea.markus)

Mircea Markus wrote:

Seems like the sender ndoe queues up messages, is the receiving node overwhelmed?
Mircea, even if the sender was queueing up lots of messages, I would expect it to create back pressure to the application. Is this an unbounded queue? looks like a design problem that needs fixing.
Actions
3. Re: JGroups OOME

matthewlowe Jul 25, 2013 7:56 AM (in response to sannegrinovero)

Is there s possibility to monitor this queue via JMX? What kind of parameters should I keep eye on if I want to see wheter messages are piling up? Orare there any other indicators I could watch, which would point to potentional problem in sender/receiver?

Btw: OOME happened before message queue got filled up. We use 60KB messages (FRAG2) and default queue of size 10.000 = 600MB of space. It got up to 470MB and than crashed. Mystery for me is, why there is 470MB of messages in first place, and how could this be found in logs or via JMX. We don't use logging on JGroups level but we will enable it, so we can get as much information as possible. Although I need to find what kind of indicators should I keep eye on.

ANY help is very much appreciated guys.
Actions
4. Re: JGroups OOME

matthewlowe Jul 26, 2013 4:12 AM (in response to matthewlowe)

Also, could someone please explan to me how to read digests like this one?
[mergeDigest()]
existing digest:   S018-62198: [1034 : 1058 (1058)], S020-25583: [810 : 833 (833)], S005-54373: [54 : 75 (75)]
new digest:        S018-62198: [1034 : 1056 (1057)], S020-25583: [810 : 833 (833)], S005-54373: [54 : 75 (75)]
resulting digest: S018-62198: [1034 : 1058 (1058)], S020-25583: [810 : 833 (833)], S005-54373: [54 : 75 (75)]
                                                |          |        |                              |        |      |                           |      |     |
                                                ?         ?       ?                             ?       ?     ?                          ?     ?    ?

I think this could be helpful indicator whether some messages are not re-transmitted

Thx
Actions

Go to original post