14 Replies Latest reply on Aug 4, 2008 2:31 AM by dilipreddy

Two-Node Cluster UDP OutOfMemoryError

jowizzle Apr 26, 2007 2:40 PM

Hello,

First, and perhaps completely unrelated: Is it normal to see messages such as "additional data: 19 bytes" throughout the logs?

Moving on...

I have a two-node cluster of stock 4.0.5GA servers. After roughly 4 hours of operation one node will fail with an OutOfMemoryError stemming from org.jgroups.protocols.UDP. Both servers have two eth interfaces, so I set bind_addr on the UDP element accordingly in cluster-service.xml and jboss-service.xml in the tc5-cluster sar.

I enabled DEBUG for jgroups. It seems to get pretty messy. First, node2 stops ack'ing on are-you-alive messages. Then node1 gets susptected, but for no apparent reason. If I understand correctly, node1 is the coord, so node2 can't remove it and it will refuse to remove itself from the view. It may, however, opt to leave and rejoin.

Below is an excerpt from the cluster log file from around the time things begin to go awry. Any hints are greatly appreciated.

2007-04-25 15:33:07,237 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32802 (own address=node1:32839)
2007-04-25 15:33:07,269 DEBUG [org.jgroups.protocols.UDP]
sending msgs:
node2:32802: 1 msgs

2007-04-25 15:33:07,284 DEBUG [org.jgroups.protocols.FD] received ack from node2:32802
2007-04-25 15:33:07,316 DEBUG [org.jgroups.protocols.UDP]
sending msgs:
node2:32802: 1 msgs

2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32802 (own address=node1:32839)
2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node2:32802 (number=0)
2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32805 (additional data: 19 bytes) (own address=node1:32842 (addit
ional data: 19 bytes))
2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node2:32805 (additional data: 19 bytes) (number=0)
2007-04-25 15:34:51,767 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node1:32842 (additional data: 19 bytes)], fro
m=node2:32805 (additional data: 19 bytes))]
2007-04-25 15:34:51,767 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3
2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] view=[node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19
bytes)]
2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node1:32842 (additional data: 19 bytes)] view is [node2:32805 (additional data:
19 bytes)|2] [node2:32805 (additional data: 19 bytes)]
2007-04-25 15:34:51,780 WARN [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion() failed, node1:32842 (additional data: 19 bytes) is not a member of view
 [node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19 bytes)]; discarding view
2007-04-25 15:34:51,781 WARN [org.jgroups.protocols.pbcast.GMS] I (node1:32842 (additional data: 19 bytes)) am being shunned, will leave and rejoin group (p
rev_members are [node1:32842 (additional data: 19 bytes) node2:32805 (additional data: 19 bytes) ])
2007-04-25 15:34:51,781 INFO [org.jgroups.JChannel] received an EXIT event, will leave the channel
2007-04-25 15:34:51,783 INFO [org.jgroups.JChannel] closing the channel
2007-04-25 15:34:51,786 ERROR [org.jgroups.protocols.UDP] [node1:32842 (additional data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap allocation fai
led, stack trace=java.lang.OutOfMemoryError: heap allocation failed
 at java.net.PlainDatagramSocketImpl.receive0(Native Method)
 at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
 at java.net.DatagramSocket.receive(DatagramSocket.java:724)
 at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
 at java.lang.Thread.run(Thread.java:799)

2007-04-25 15:34:51,790 ERROR [org.jgroups.protocols.UDP] [node1:32839] exception=java.lang.OutOfMemoryError: heap allocation failed, stack trace=java.lang.O
utOfMemoryError: heap allocation failed
 at java.net.PlainDatagramSocketImpl.receive0(Native Method)
 at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
 at java.net.DatagramSocket.receive(DatagramSocket.java:724)
 at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
 at java.lang.Thread.run(Thread.java:799)

2007-04-25 15:34:51,795 DEBUG [org.jgroups.protocols.pbcast.NAKACK] contents for node1:32842 (additional data: 19 bytes):

sent_msgs: [6837 - 6890]
received_msgs:
node2:32805 (additional data: 19 bytes): received_msgs: [], delivered_msgs: [276 - 328]
node1:32842 (additional data: 19 bytes): received_msgs: [], delivered_msgs: [6838 - 6890]

2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] socket to node2:32805 (additional data: 19 bytes) was reset
2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated
2007-04-25 15:34:51,825 DEBUG [org.jgroups.protocols.UDP]
sending msgs:
node1:32839: 1 msgs

2007-04-25 15:34:52,092 ERROR [org.jgroups.protocols.UDP] [node1:32842 (additional data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap allocation fai
led, stack trace=java.lang.OutOfMemoryError: heap allocation failed
 at java.net.PlainDatagramSocketImpl.receive0(Native Method)
 at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
 at java.net.DatagramSocket.receive(DatagramSocket.java:724)
 at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
 at java.lang.Thread.run(Thread.java:799)

1. Re: Two-Node Cluster UDP OutOfMemoryError

visprar May 3, 2007 4:38 AM (in response to jowizzle)

We are also facing similar issue. Trying to upgrade to from 2.2.9.2 to 2.4 sp2.. not sure if that will help. whats your jgroups version
Actions
2. Re: Two-Node Cluster UDP OutOfMemoryError

brian.stansberry May 3, 2007 8:38 AM (in response to jowizzle)

It's hard to tell cause from effect in this kind of situation. Your log shows node1 being suspected by node2 and properly starting the process of closing down the channel to rejoin. Then a few ms later the vm runs out of memory.

Most likely whatever was going on that eventually led to the OOME was also making node1 unresponsive enough that node2 suspected it.

The question is why the OOME occurred. First, it's *extremely* unlikely the process of handling the suspicion and closing the channel is itself what caused the OOME. Second, the fact that UDP is what threw the OOME doesn't really mean it or JGroups was the underlying cause -- it just means UDP was the code trying to allocate an object when the heap was finally out of space.

JGroups 2.4.1.SP3 has an improvement to the FC (flow control) protocol that prevents an OOME condition that could occur when the channel is running under sustained overload. That may help. But, IMHO the odds are pretty low that that was the cause of your OOME. You're better off trying to profile your application to confirm you have no memory leaks.
Actions
3. Re: Two-Node Cluster UDP OutOfMemoryError

visprar May 3, 2007 11:50 AM (in response to jowizzle)

So, is there a bench mark test that JBoss has. Where 2 -3 nodes in a cluster run perfectly for hours without any issues. If yes, where can i find the conf details for the same (like JVM Heap, cluster config).

Is the default config with Jboss fairly accurate, or do you suggest, fine tuning it. Here is my config with 1.4 GB Ram on Suse

UDP(down_thread=false;enable_bundling=true;ip_ttl=2;loopback=false;max_bundle_size=64000;max_bundle_timeout=30;mcast_addr=228.1.2.8;mcast_port=45503;mcast_recv_buf_size=25000000;mcast_send_buf_size=640000;ucast_recv_buf_size=20000000;ucast_send_buf_size=640000;up_thread=false;use_incoming_packet_handler=true;use_outgoing_packet_handler=true):PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):MERGE2(down_thread=false;max_interval=100000;min_interval=20000;up_thread=false):FD(down_thread=false;max_tries=5;shun=true;timeout=2500;up_thread=false):VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):pbcast.NAKACK(discard_delivered_msgs=true;down_thread=false;gc_lag=50;max_xmit_size=60000;retransmit_timeout=100,200,300,600,1200,2400,4800;up_thread=false;use_mcast_xmit=false):UNICAST(down_thread=false;timeout=300,600,1200,2400,3600;up_thread=false):pbcast.STABLE(desired_avg_gossip=50000;down_thread=false;max_bytes=2100000;stability_delay=1000;up_thread=false):pbcast.GMS(down_thread=false;join_retry_timeout=2000;join_timeout=3000;print_local_addr=true;shun=true;up_thread=false):FC(down_thread=false;max_credits=10000000;min_threshold=0.20;up_thread=false):FRAG2(down_thread=false;frag_size=60000;up_thread=false):pbcast.STATE_TRANSFER(down_thread=false;up_thread=false)
Actions
4. Re: Two-Node Cluster UDP OutOfMemoryError

jowizzle May 15, 2007 5:33 PM (in response to jowizzle)

Thanks for the reply. The OOME occurs without any applications deployed. I fiddled with the new JGroups, but I decided to wait for 4.2 GA. I'm in the process of rolling that out now.
Actions
5. Re: Two-Node Cluster UDP OutOfMemoryError

jowizzle May 16, 2007 11:52 AM (in response to jowizzle)

Checked this morning with 4.2. Same thing. Perhaps it's environmental. I'm going to switch from IBM to Sun and see where I wind up tomorrow.
Actions
6. Re: Two-Node Cluster UDP OutOfMemoryError

brian.stansberry May 16, 2007 12:26 PM (in response to jowizzle)

So you're saying that you start a cluster with no applications deployed, and then after 4 hours of operation you get an OOME?
Actions
7. Re: Two-Node Cluster UDP OutOfMemoryError

jowizzle May 17, 2007 10:37 AM (in response to jowizzle)

Correct. No applications deployed.

I apologize for leaving out a critical detail. I run several stand-alone JBoss instances on the IBM jvm, so I've been running the cluster on IBM as well. Yesterday I switched to Sun, and it's been up and running since (about 21 hours).
Actions
8. Re: Two-Node Cluster UDP OutOfMemoryError

brian.stansberry May 17, 2007 11:52 AM (in response to jowizzle)

Weird. I've pinged a couple of colleagues to see if they've heard about anything odd with the IBM VMs. I know we have customers running significant clusters using the IBM VMs and haven't heard about OOME issues.

So basically you just fire up a couple of stock AS instances and let them run (presumably doing nothing) and after a while they OOME. Very strange.

Anything interesting about your topology? Are the AS instances on the same machine?

BTW, to answer your original question, the "additional data" logging is normal.
Actions
9. Re: Two-Node Cluster UDP OutOfMemoryError

jowizzle May 17, 2007 12:33 PM (in response to jowizzle)

I have two identical physical machines each with two network interfaces, eth0 and eth1. Each machine runs one instance of JBoss 4.0.5 GA (configured, not not modified). We bind everything to eth0.

I start node1 and wait for it to come online. Then I start node2. (The end result is the same if I reverse the order.)

Node1 and node2 are on the same switch. Nothing fancy. Using TCP results in the same OOME.

If you think it's necessary, I'd be happy to compile a full spec and help troubleshoot, as I've been looking to get involved in the project anyway.
Actions
10. Re: Two-Node Cluster UDP OutOfMemoryError

brian.stansberry May 17, 2007 1:00 PM (in response to jowizzle)

Sure, anything you can do to help troubleshoot would be appreciated. I'd be interested in knowing if jconsole shows the memory usage slowly growing over time or whether it suddenly rises. Either way, some heap snapshots would be good. A google search showed this tool that may be helpful: http://www-1.ibm.com/support/docview.wss?rs=180&uid=swg24005757
Actions
11. Re: Two-Node Cluster UDP OutOfMemoryError

jowizzle May 17, 2007 2:03 PM (in response to jowizzle)

Great link, thanks. I'll do some investigation.
Actions
12. Re: Two-Node Cluster UDP OutOfMemoryError

nidhingj Mar 10, 2008 6:55 PM (in response to jowizzle)

Hi,

Did anyone get any clue regarding this OOME? I am facing similar issue in my application also. I just want to confirm whether it is a JVM issue.
It would very helpful if someone could reply.

Regards
Nidhin
Actions
13. Re: Two-Node Cluster UDP OutOfMemoryError

nidhingj Mar 10, 2008 7:02 PM (in response to jowizzle)

"nidhingj" wrote:
Hi,

Did anyone get any clue regarding this OOME? I am facing similar issue in my application also. I ran my application in Sun JVM and it did not throw me an exception. But when I run in IBM JVM, I am gettiing OOME consistently after 4-5 hours.
It would very helpful if someone could reply.

Regards
Nidhin
Actions
14. Re: Two-Node Cluster UDP OutOfMemoryError

dilipreddy Aug 4, 2008 2:31 AM (in response to jowizzle)

Hi
I think it might me useful, lets take a glance at this

ïƒ To solve â€œOutofMemoryException:Max GenSpace Exceptionâ€� in Jboss l server:

http://www.innoq.com/blog/sp/2008/01/jboss_as_42x_and_javalangoutof.html

see this link and edit the run.bat in windows of conf.run in Unix
Actions

Go to original post