Two-Node Cluster UDP OutOfMemoryError
jowizzle Apr 26, 2007 2:40 PMHello,
First, and perhaps completely unrelated: Is it normal to see messages such as "additional data: 19 bytes" throughout the logs?
Moving on...
I have a two-node cluster of stock 4.0.5GA servers. After roughly 4 hours of operation one node will fail with an OutOfMemoryError stemming from org.jgroups.protocols.UDP. Both servers have two eth interfaces, so I set bind_addr on the UDP element accordingly in cluster-service.xml and jboss-service.xml in the tc5-cluster sar.
I enabled DEBUG for jgroups. It seems to get pretty messy. First, node2 stops ack'ing on are-you-alive messages. Then node1 gets susptected, but for no apparent reason. If I understand correctly, node1 is the coord, so node2 can't remove it and it will refuse to remove itself from the view. It may, however, opt to leave and rejoin.
Below is an excerpt from the cluster log file from around the time things begin to go awry. Any hints are greatly appreciated.
2007-04-25 15:33:07,237 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32802 (own address=node1:32839) 2007-04-25 15:33:07,269 DEBUG [org.jgroups.protocols.UDP] sending msgs: node2:32802: 1 msgs 2007-04-25 15:33:07,284 DEBUG [org.jgroups.protocols.FD] received ack from node2:32802 2007-04-25 15:33:07,316 DEBUG [org.jgroups.protocols.UDP] sending msgs: node2:32802: 1 msgs 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32802 (own address=node1:32839) 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node2:32802 (number=0) 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending are-you-alive msg to node2:32805 (additional data: 19 bytes) (own address=node1:32842 (addit ional data: 19 bytes)) 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing from node2:32805 (additional data: 19 bytes) (number=0) 2007-04-25 15:34:51,767 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect hdr is [FD: SUSPECT (suspected_mbrs=[node1:32842 (additional data: 19 bytes)], fro m=node2:32805 (additional data: 19 bytes))] 2007-04-25 15:34:51,767 WARN [org.jgroups.protocols.FD] I was suspected, but will not remove myself from membership (waiting for EXIT message) 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable task started; num_gossip_runs=3, max_gossip_runs=3 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] view=[node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19 bytes)] 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.GMS] [local_addr=node1:32842 (additional data: 19 bytes)] view is [node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19 bytes)] 2007-04-25 15:34:51,780 WARN [org.jgroups.protocols.pbcast.GMS] checkSelfInclusion() failed, node1:32842 (additional data: 19 bytes) is not a member of view [node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 19 bytes)]; discarding view 2007-04-25 15:34:51,781 WARN [org.jgroups.protocols.pbcast.GMS] I (node1:32842 (additional data: 19 bytes)) am being shunned, will leave and rejoin group (p rev_members are [node1:32842 (additional data: 19 bytes) node2:32805 (additional data: 19 bytes) ]) 2007-04-25 15:34:51,781 INFO [org.jgroups.JChannel] received an EXIT event, will leave the channel 2007-04-25 15:34:51,783 INFO [org.jgroups.JChannel] closing the channel 2007-04-25 15:34:51,786 ERROR [org.jgroups.protocols.UDP] [node1:32842 (additional data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap allocation fai led, stack trace=java.lang.OutOfMemoryError: heap allocation failed at java.net.PlainDatagramSocketImpl.receive0(Native Method) at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181) at java.net.DatagramSocket.receive(DatagramSocket.java:724) at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264) at java.lang.Thread.run(Thread.java:799) 2007-04-25 15:34:51,790 ERROR [org.jgroups.protocols.UDP] [node1:32839] exception=java.lang.OutOfMemoryError: heap allocation failed, stack trace=java.lang.O utOfMemoryError: heap allocation failed at java.net.PlainDatagramSocketImpl.receive0(Native Method) at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181) at java.net.DatagramSocket.receive(DatagramSocket.java:724) at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264) at java.lang.Thread.run(Thread.java:799) 2007-04-25 15:34:51,795 DEBUG [org.jgroups.protocols.pbcast.NAKACK] contents for node1:32842 (additional data: 19 bytes): sent_msgs: [6837 - 6890] received_msgs: node2:32805 (additional data: 19 bytes): received_msgs: [], delivered_msgs: [276 - 328] node1:32842 (additional data: 19 bytes): received_msgs: [], delivered_msgs: [6838 - 6890] 2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] socket to node2:32805 (additional data: 19 bytes) was reset 2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread terminated 2007-04-25 15:34:51,825 DEBUG [org.jgroups.protocols.UDP] sending msgs: node1:32839: 1 msgs 2007-04-25 15:34:52,092 ERROR [org.jgroups.protocols.UDP] [node1:32842 (additional data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap allocation fai led, stack trace=java.lang.OutOfMemoryError: heap allocation failed at java.net.PlainDatagramSocketImpl.receive0(Native Method) at java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181) at java.net.DatagramSocket.receive(DatagramSocket.java:724) at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264) at java.lang.Thread.run(Thread.java:799)