14 Replies Latest reply on Aug 4, 2008 2:31 AM by dilipreddy

    Two-Node Cluster UDP OutOfMemoryError

    jowizzle

      Hello,

      First, and perhaps completely unrelated: Is it normal to see messages such as "additional data: 19 bytes" throughout the logs?

      Moving on...

      I have a two-node cluster of stock 4.0.5GA servers. After roughly 4 hours of operation one node will fail with an OutOfMemoryError stemming from org.jgroups.protocols.UDP. Both servers have two eth interfaces, so I set bind_addr on the UDP element accordingly in cluster-service.xml and jboss-service.xml in the tc5-cluster sar.

      I enabled DEBUG for jgroups. It seems to get pretty messy. First, node2 stops ack'ing on are-you-alive messages. Then node1 gets susptected, but for no apparent reason. If I understand correctly, node1 is the coord, so node2 can't remove it and it will refuse to remove itself from the view. It may, however, opt to leave and rejoin.

      Below is an excerpt from the cluster log file from around the time things begin to go awry. Any hints are greatly appreciated.

      2007-04-25 15:33:07,237 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32802 (own address=node1:32839)
      2007-04-25 15:33:07,269 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      node2:32802: 1 msgs
      
      2007-04-25 15:33:07,284 DEBUG [org.jgroups.protocols.FD] received ack from node2:32802
      2007-04-25 15:33:07,316 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      node2:32802: 1 msgs
      
      2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32802 (own address=node1:32839)
      2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node2:32802 (number=0)
      2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32805 (additional data: 19 bytes) (own address=node1:32842 (addit
      ional data: 19 bytes))
      2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node2:32805 (additional data: 19 bytes) (number=0)
      2007-04-25 15:34:51,767 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node1:32842 (additional data: 19 bytes)], fro
      m=node2:32805 (additional data: 19 bytes))]
      2007-04-25 15:34:51,767 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
      2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
      2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] view=[node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19
      bytes)]
      2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node1:32842 (additional data: 19 bytes)] view is [node2:32805 (additional data:
      19 bytes)|2] [node2:32805 (additional data: 19 bytes)]
      2007-04-25 15:34:51,780 WARN [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion() failed, node1:32842 (additional data: 19 bytes) is not a member of view
       [node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19 bytes)]; discarding view
      2007-04-25 15:34:51,781 WARN [org.jgroups.protocols.pbcast.GMS] I (node1:32842 (additional data: 19 bytes)) am being shunned, will leave and rejoin group (p
      rev_members are [node1:32842 (additional data: 19 bytes) node2:32805 (additional data: 19 bytes) ])
      2007-04-25 15:34:51,781 INFO [org.jgroups.JChannel] received an EXIT event, will leave the channel
      2007-04-25 15:34:51,783 INFO [org.jgroups.JChannel] closing the channel
      2007-04-25 15:34:51,786 ERROR [org.jgroups.protocols.UDP] [node1:32842 (additional data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap allocation fai
      led, stack trace=java.lang.OutOfMemoryError: heap allocation failed
       at java.net.PlainDatagramSocketImpl.receive0(Native Method)
       at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
       at java.net.DatagramSocket.receive(DatagramSocket.java:724)
       at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
       at java.lang.Thread.run(Thread.java:799)
      
      2007-04-25 15:34:51,790 ERROR [org.jgroups.protocols.UDP] [node1:32839] exception=java.lang.OutOfMemoryError: heap allocation failed, stack trace=java.lang.O
      utOfMemoryError: heap allocation failed
       at java.net.PlainDatagramSocketImpl.receive0(Native Method)
       at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
       at java.net.DatagramSocket.receive(DatagramSocket.java:724)
       at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
       at java.lang.Thread.run(Thread.java:799)
      
      2007-04-25 15:34:51,795 DEBUG [org.jgroups.protocols.pbcast.NAKACK] contents for node1:32842 (additional data: 19 bytes):
      
      sent_msgs: [6837 - 6890]
      received_msgs:
      node2:32805 (additional data: 19 bytes): received_msgs: [], delivered_msgs: [276 - 328]
      node1:32842 (additional data: 19 bytes): received_msgs: [], delivered_msgs: [6838 - 6890]
      
      2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] socket to node2:32805 (additional data: 19 bytes) was reset
      2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated
      2007-04-25 15:34:51,825 DEBUG [org.jgroups.protocols.UDP]
      sending msgs:
      node1:32839: 1 msgs
      
      2007-04-25 15:34:52,092 ERROR [org.jgroups.protocols.UDP] [node1:32842 (additional data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap allocation fai
      led, stack trace=java.lang.OutOfMemoryError: heap allocation failed
       at java.net.PlainDatagramSocketImpl.receive0(Native Method)
       at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
       at java.net.DatagramSocket.receive(DatagramSocket.java:724)
       at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
       at java.lang.Thread.run(Thread.java:799)
      


        • 1. Re: Two-Node Cluster UDP OutOfMemoryError
          visprar

          We are also facing similar issue. Trying to upgrade to from 2.2.9.2 to 2.4 sp2.. not sure if that will help. whats your jgroups version

          • 2. Re: Two-Node Cluster UDP OutOfMemoryError
            brian.stansberry

            It's hard to tell cause from effect in this kind of situation. Your log shows node1 being suspected by node2 and properly starting the process of closing down the channel to rejoin. Then a few ms later the vm runs out of memory.

            Most likely whatever was going on that eventually led to the OOME was also making node1 unresponsive enough that node2 suspected it.

            The question is why the OOME occurred. First, it's *extremely* unlikely the process of handling the suspicion and closing the channel is itself what caused the OOME. Second, the fact that UDP is what threw the OOME doesn't really mean it or JGroups was the underlying cause -- it just means UDP was the code trying to allocate an object when the heap was finally out of space.

            JGroups 2.4.1.SP3 has an improvement to the FC (flow control) protocol that prevents an OOME condition that could occur when the channel is running under sustained overload. That may help. But, IMHO the odds are pretty low that that was the cause of your OOME. You're better off trying to profile your application to confirm you have no memory leaks.

            • 3. Re: Two-Node Cluster UDP OutOfMemoryError
              visprar

              So, is there a bench mark test that JBoss has. Where 2 -3 nodes in a cluster run perfectly for hours without any issues. If yes, where can i find the conf details for the same (like JVM Heap, cluster config).

              Is the default config with Jboss fairly accurate, or do you suggest, fine tuning it. Here is my config with 1.4 GB Ram on Suse

              UDP(down_thread=false;enable_bundling=true;ip_ttl=2;loopback=false;max_bundle_size=64000;max_bundle_timeout=30;mcast_addr=228.1.2.8;mcast_port=45503;mcast_recv_buf_size=25000000;mcast_send_buf_size=640000;ucast_recv_buf_size=20000000;ucast_send_buf_size=640000;up_thread=false;use_incoming_packet_handler=true;use_outgoing_packet_handler=true):PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):MERGE2(down_thread=false;max_interval=100000;min_interval=20000;up_thread=false):FD(down_thread=false;max_tries=5;shun=true;timeout=2500;up_thread=false):VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):pbcast.NAKACK(discard_delivered_msgs=true;down_thread=false;gc_lag=50;max_xmit_size=60000;retransmit_timeout=100,200,300,600,1200,2400,4800;up_thread=false;use_mcast_xmit=false):UNICAST(down_thread=false;timeout=300,600,1200,2400,3600;up_thread=false):pbcast.STABLE(desired_avg_gossip=50000;down_thread=false;max_bytes=2100000;stability_delay=1000;up_thread=false):pbcast.GMS(down_thread=false;join_retry_timeout=2000;join_timeout=3000;print_local_addr=true;shun=true;up_thread=false):FC(down_thread=false;max_credits=10000000;min_threshold=0.20;up_thread=false):FRAG2(down_thread=false;frag_size=60000;up_thread=false):pbcast.STATE_TRANSFER(down_thread=false;up_thread=false)

              • 4. Re: Two-Node Cluster UDP OutOfMemoryError
                jowizzle

                Thanks for the reply. The OOME occurs without any applications deployed. I fiddled with the new JGroups, but I decided to wait for 4.2 GA. I'm in the process of rolling that out now.

                • 5. Re: Two-Node Cluster UDP OutOfMemoryError
                  jowizzle

                  Checked this morning with 4.2. Same thing. Perhaps it's environmental. I'm going to switch from IBM to Sun and see where I wind up tomorrow.

                  • 6. Re: Two-Node Cluster UDP OutOfMemoryError
                    brian.stansberry

                    So you're saying that you start a cluster with no applications deployed, and then after 4 hours of operation you get an OOME?

                    • 7. Re: Two-Node Cluster UDP OutOfMemoryError
                      jowizzle

                      Correct. No applications deployed.

                      I apologize for leaving out a critical detail. I run several stand-alone JBoss instances on the IBM jvm, so I've been running the cluster on IBM as well. Yesterday I switched to Sun, and it's been up and running since (about 21 hours).

                      • 8. Re: Two-Node Cluster UDP OutOfMemoryError
                        brian.stansberry

                        Weird. I've pinged a couple of colleagues to see if they've heard about anything odd with the IBM VMs. I know we have customers running significant clusters using the IBM VMs and haven't heard about OOME issues.

                        So basically you just fire up a couple of stock AS instances and let them run (presumably doing nothing) and after a while they OOME. Very strange.

                        Anything interesting about your topology? Are the AS instances on the same machine?

                        BTW, to answer your original question, the "additional data" logging is normal.

                        • 9. Re: Two-Node Cluster UDP OutOfMemoryError
                          jowizzle

                          I have two identical physical machines each with two network interfaces, eth0 and eth1. Each machine runs one instance of JBoss 4.0.5 GA (configured, not not modified). We bind everything to eth0.

                          I start node1 and wait for it to come online. Then I start node2. (The end result is the same if I reverse the order.)

                          Node1 and node2 are on the same switch. Nothing fancy. Using TCP results in the same OOME.

                          If you think it's necessary, I'd be happy to compile a full spec and help troubleshoot, as I've been looking to get involved in the project anyway.

                          • 10. Re: Two-Node Cluster UDP OutOfMemoryError
                            brian.stansberry

                            Sure, anything you can do to help troubleshoot would be appreciated. I'd be interested in knowing if jconsole shows the memory usage slowly growing over time or whether it suddenly rises. Either way, some heap snapshots would be good. A google search showed this tool that may be helpful: http://www-1.ibm.com/support/docview.wss?rs=180&uid=swg24005757

                            • 11. Re: Two-Node Cluster UDP OutOfMemoryError
                              jowizzle

                              Great link, thanks. I'll do some investigation.

                              • 12. Re: Two-Node Cluster UDP OutOfMemoryError
                                nidhingj

                                Hi,

                                Did anyone get any clue regarding this OOME? I am facing similar issue in my application also. I just want to confirm whether it is a JVM issue.
                                It would very helpful if someone could reply.

                                Regards
                                Nidhin

                                • 13. Re: Two-Node Cluster UDP OutOfMemoryError
                                  nidhingj

                                   

                                  "nidhingj" wrote:
                                  Hi,

                                  Did anyone get any clue regarding this OOME? I am facing similar issue in my application also. I ran my application in Sun JVM and it did not throw me an exception. But when I run in IBM JVM, I am gettiing OOME consistently after 4-5 hours.
                                  It would very helpful if someone could reply.

                                  Regards
                                  Nidhin


                                  • 14. Re: Two-Node Cluster UDP OutOfMemoryError
                                    dilipreddy

                                    Hi
                                    I think it might me useful, lets take a glance at this

                                    To solve “OutofMemoryException:Max GenSpace Exception� in Jboss l server:

                                    http://www.innoq.com/blog/sp/2008/01/jboss_as_42x_and_javalangoutof.html

                                    see this link and edit the run.bat in windows of conf.run in Unix