1 Reply Latest reply on Nov 24, 2008 3:52 PM by brian.stansberry

Wierd issue with clustered nodes

mohitanchlia Nov 24, 2008 11:40 AM

I've posted the issue here: http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4138599#4138599

I am also pasting it here, in case other one is not being looked at:

I am seeing some weird behavior. We are running into this serious issue where Nodes join the cluster and soon after they disappear from the cluster. So for eg: If I have 5 nodes then they initially join the cluster and then after some time we see any one of the following:

1. Dead member message for one of the nodes - even though the dead node is up and running. I can run Jgroup sender reciever test from dead node to other nodes with no problems. I would assume that dead member would try to communicate back after some time in case there was a temporary problem. But that doesn't seem to be happening.

2. As noted above in your discussion I get 2008-02-21 08:35:11,784 WARN [org.jgroups.protocols.pbcast.GMS] failed to collect all ACKs (3) for view [172.17.65.39:40883|5] [172.17.65.39:40883, 172.17.66.39:35267, 172.17.67.39:39896, 172.17.64.39:52927] after 5000ms, missing ACKs from [172.17.65.39:40883, 172.17.66.39:35267, 172.17.6 .....

I am not sure why that;s happening and what it really means. All I can guess is that it's not able to get the datagram. Also, I am assuming this is the coordinator.

3. In a cluster of 5, all 5 initially join cluster and after some time what we see is that node 1,2,3 become part of one cluster and 4,5 becomes another cluster. All of them have same udp group, name and port so I don't really understand how they can split and why they don't get merged back together if there was a temporary issue.

Overall I am not able to understand this wierdness. I am planning to run some Jgroup load test. We've spoken to our network team and they don't see any issues on switch. I've looked at the NIC and don't see any problems. IGMP is enabled on all the routers. Also, how can I tell which node is now the coordinator?

I also did tracroute to make sure ttl is not the problem.

It would really be helpful if you could let me know how I can debug this issue. It's really weird. Below is the UDP jgroups config:

30000





<UDP mcast_addr="${efe.partition.udpGroup:228.1.2.3}"
mcast_port="${jboss.hapartition.mcast_port:45566}"
tos="8"
ucast_recv_buf_size="20000000"
ucast_send_buf_size="640000"
mcast_recv_buf_size="25000000"
mcast_send_buf_size="640000"
loopback="false"
discard_incompatible_packets="true"
enable_bundling="false"
max_bundle_size="64000"
max_bundle_timeout="30"
use_incoming_packet_handler="true"
use_outgoing_packet_handler="false"
ip_ttl="${jgroups.udp.ip_ttl:2}"
down_thread="false" up_thread="false"/>
<PING timeout="2000"
down_thread="false" up_thread="false" num_initial_members="3"/>
<MERGE2 max_interval="100000"
down_thread="false" up_thread="false" min_interval="20000"/>
<FD_SOCK down_thread="false" up_thread="false"/>
<FD timeout="10000" max_tries="5" down_thread="false" up_thread="false" shun="true"/>
<VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false"/>
<pbcast.NAKACK max_xmit_size="60000"
use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
down_thread="false" up_thread="false"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"
down_thread="false" up_thread="false"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
down_thread="false" up_thread="false"
max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
down_thread="false" up_thread="false"
join_retry_timeout="2000" shun="true"
view_bundling="true"/>
<FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
<pbcast.STATE_TRANSFER down_thread="false" up_thread="false" use_flush="false"/>
---------

1. Re: Wierd issue with clustered nodes

brian.stansberry Nov 24, 2008 3:52 PM (in response to mohitanchlia)

This sounds like some sort of network problem. In particular

2008-02-21 08:35:11,784 WARN [org.jgroups.protocols.pbcast.GMS] failed to collect all ACKs (3) for view [172.17.65.39:40883|5] [172.17.65.39:40883, 172.17.66.39:35267, 172.17.67.39:39896, 172.17.64.39:52927] after 5000ms, missing ACKs from [172.17.65.39:40883, 172.17.66.39:35267, 172.17.6 .....

sounds like 172.17.65.39:40883 is not able to get a response from itself to its own message. The OS should copy a multicast back to an in-machine listener making it go out on the wire, so not receiving your own message is a sign that either:

1) There's a problem with the network interface being used.
2) The thread that delivers messages has gotten stuck. Getting a stack trace can help show if that has occurred.
Actions