4 Replies Latest reply on Jul 26, 2013 4:12 AM by matthewlowe

    JGroups OOME

    matthewlowe

      Hi all,

       

      I would like to ask you for help in analyzing OOME which I encountered in JGroups. At least I think it's JGroups. In one of our customer's production they encountered OOME on Infinispan in 5 node cluster. When analyzing memory dump, I noticed that there's ~470MB of retained heap held by org.jgroups.blocks.TCPConnectionMap$TCPConnection$Sender.

       

      When I list JVM object by their count, top of the list looks like this

      histogram_top.PNG

      This leads me to suspicion, that messages are not sent and they're piling up on sender which eventually causes OOME. But I can't figure out why this would happen.

      Could this issue be caused by some of the cluster nodes being down? Or is there any "natural" scenario in which OOME in JGroups can happen?

       

      This is exception, which according to Yourkit memory profiler caused OOME

      Timer-19,_threadNameOmmitted_32726 tid=188 [RUNNABLE] [DAEMON] <--- OutOfMemoryError happened in this thread

      java.lang.OutOfMemoryError.<init>()

      org.jgroups.blocks.TCPConnectionMap$TCPConnection.send(byte[], int, int)

      org.jgroups.blocks.TCPConnectionMap$TCPConnection.access$100(TCPConnectionMap$TCPConnection, byte[], int, int)

      org.jgroups.blocks.TCPConnectionMap.send(Address, byte[], int, int)

      org.jgroups.protocols.TCP.send(Address, byte[], int, int)

      org.jgroups.protocols.BasicTCP.sendUnicast(PhysicalAddress, byte[], int, int)

      org.jgroups.protocols.TP.sendToSingleMember(Address, byte[], int, int)

      org.jgroups.protocols.TP.doSend(Buffer, Address, boolean)

      org.jgroups.protocols.TP.send(Message, Address, boolean)

      org.jgroups.protocols.TP.down(Event)

      org.jgroups.protocols.Discovery.down(Event)

      org.jgroups.protocols.TCPPING.down(Event)

      org.jgroups.protocols.MERGE2.down(Event)

      org.jgroups.protocols.FD_SOCK.down(Event)

      org.jgroups.protocols.FD.down(Event)

      org.jgroups.protocols.VERIFY_SUSPECT.down(Event)

      org.jgroups.protocols.pbcast.NAKACK.down(Event)

      org.jgroups.protocols.UNICAST.retransmit(long, Message)

      org.jgroups.stack.AckSenderWindow.retransmit(long, long, Address)

      org.jgroups.stack.DefaultRetransmitter$SeqnoTask.callRetransmissionCommand()

      org.jgroups.stack.Retransmitter$Task.run()

      org.jgroups.util.TimeScheduler2$MyTask.run()

      org.jgroups.util.TimeScheduler2$Entry.execute()

      org.jgroups.util.TimeScheduler2$1.run()

      java.lang.Thread.run()

       

      Any help is much appreciated.

       

      Matthew

        • 1. Re: JGroups OOME
          mircea.markus

          Seems like the sender ndoe queues up messages, is the receiving node overwhelmed?

          • 2. Re: JGroups OOME
            sannegrinovero

            Mircea Markus wrote:

             

            Seems like the sender ndoe queues up messages, is the receiving node overwhelmed?

            Mircea, even if the sender was queueing up lots of messages, I would expect it to create back pressure to the application. Is this an unbounded queue? looks like a design problem that needs fixing.

            • 3. Re: JGroups OOME
              matthewlowe

              Is there s possibility to monitor this queue via JMX? What kind of parameters should I keep eye on if I want to see wheter messages are piling up? Orare there any other indicators I could watch, which would point to potentional problem in sender/receiver?

               

              Btw: OOME happened before message queue got filled up. We use 60KB messages (FRAG2) and default queue of size 10.000 = 600MB of space. It got up to 470MB and than crashed. Mystery for me is, why there is 470MB of messages in first place, and how could this be found in logs or via JMX. We don't use logging on JGroups level but we will enable it, so we can get as much information as possible. Although I need to find what kind of indicators should I keep eye on.

               

              ANY help is very much appreciated guys.

              • 4. Re: JGroups OOME
                matthewlowe

                Also, could someone please explan to me how to read digests like this one?

                [mergeDigest()]

                existing digest:   S018-62198: [1034 : 1058 (1058)], S020-25583: [810 : 833 (833)], S005-54373: [54 : 75 (75)]

                new digest:        S018-62198: [1034 : 1056 (1057)], S020-25583: [810 : 833 (833)], S005-54373: [54 : 75 (75)]

                resulting digest:  S018-62198: [1034 : 1058 (1058)], S020-25583: [810 : 833 (833)], S005-54373: [54 : 75 (75)]

                                                                |          |        |                              |        |      |                           |      |     | 

                                                                ?         ?       ?                             ?       ?     ?                          ?     ?    ?

                 

                I think this could be helpful indicator whether some messages are not re-transmitted

                 

                Thx