3 Replies Latest reply on Feb 9, 2010 12:16 PM by galder.zamarreno

Master-slave issue

shenaz Jan 15, 2010 10:40 AM

Hi All,

We are facing a strange issue. We have two nodes serverA and serverB in a cluster (on solaris). serverA is started first and it starts the cluster. Then serverB is started. While starting, serverB doesnt not recognise the existing cluster and starts its own cluster. But after a while (serverB is still starting), serverA recognises the serverB cluster and joins it as a slave.

Ideally serverB should have joined the serverA cluster as a slave. Why is serverB not able to recognise the serverA cluster? What could be the possible reasons? This is happening at our client location and we don't have the access to the servers. All we have is server logs.

Kindly help us.

Regards

Shenaz

1. Re: Master-slave issue

brian.stansberry Jan 15, 2010 8:36 PM (in response to shenaz)

JBoss AS clustering isn't master-slave based, it's peer-to-peer. I'm going to assume by "master" you're referring to the JGroups coordinator, which is just a peer that's doing a few extra administrative chores. If you mean something else, please clarify so we're on the same page.

What you describe sounds like a problem in initial server discovery, with the JGroups MERGE2 protocol (http://community.jboss.org/wiki/JGroupsMERGE2) eventually finding the other server. I suspect something with the servers' network configuration or with other network hardware is leading to a high rate of UDP packet loss. When serverB starts, it sends out a UDP multicast discovery packet (http://community.jboss.org/wiki/JGroupsPING); if that packet is not received and responded to by serverA, serverB will think it's alone and will form a group by itself until another packet sent by MERGE2 is eventually received.

Have your client's network admins check for UDP multicast packet loss.

If you turn on TRACE level logging for the org.jgroups.protocols.UDP category you'll get logging as messages are sent and received, so you can trace what happens to the discovery packets (i.e. confirm they aren't received.) But that doesn't tell you why; just confirms my diagnosis. Note that the logging can be quite voluminous once a group forms if the servers are doing work that leads to a lot of message traffic.
Actions
2. Re: Master-slave issue

shenaz Jan 19, 2010 3:32 AM (in response to brian.stansberry)

Hi Brian,

Thanks a ton for giving us the lead. We also feel that one of the servers is acting funny. If you take that server out of the equation, everything is fine. We have asked for server logs with TRACE level for UDP. Once we get them, hopefully we can conclude it.

Regards
Shenaz
Actions
3. Re: Master-slave issue

galder.zamarreno Feb 9, 2010 12:16 PM (in response to shenaz)
Here's a list of most common reasons I've seen for flaky or intermittent multicast traffic and some tips to fix it:
wrong bind address (either xml or system property)
low TTL configuration
on linux/unix, routing table misconfigured.
firewall at nodes and/or router blocking multicast traffic.
multicast disabled in the router/switch.
switch temporarily discarding packets on a port due to IGMP snooping.
potentially an IPv6 issue, try running with -Djava.net.preferIPv4Stack=true
try connecting all machines in the cluster to the same hub rather than connecting them through a router or switch -> good in case there's some obscure setting in the router/switch.

In general, running Cluster Formation Tests, JGroups Performance Tests and using a network sniffer such as wireshark should help you point out where the issue is coming from.
Actions

Go to original post