JBoss AS clustering isn't master-slave based, it's peer-to-peer. I'm going to assume by "master" you're referring to the JGroups coordinator, which is just a peer that's doing a few extra administrative chores. If you mean something else, please clarify so we're on the same page.
What you describe sounds like a problem in initial server discovery, with the JGroups MERGE2 protocol (http://community.jboss.org/wiki/JGroupsMERGE2) eventually finding the other server. I suspect something with the servers' network configuration or with other network hardware is leading to a high rate of UDP packet loss. When serverB starts, it sends out a UDP multicast discovery packet (http://community.jboss.org/wiki/JGroupsPING); if that packet is not received and responded to by serverA, serverB will think it's alone and will form a group by itself until another packet sent by MERGE2 is eventually received.
Have your client's network admins check for UDP multicast packet loss.
If you turn on TRACE level logging for the org.jgroups.protocols.UDP category you'll get logging as messages are sent and received, so you can trace what happens to the discovery packets (i.e. confirm they aren't received.) But that doesn't tell you why; just confirms my diagnosis. Note that the logging can be quite voluminous once a group forms if the servers are doing work that leads to a lot of message traffic.
Thanks a ton for giving us the lead. We also feel that one of the servers is acting funny. If you take that server out of the equation, everything is fine. We have asked for server logs with TRACE level for UDP. Once we get them, hopefully we can conclude it.
Here's a list of most common reasons I've seen for flaky or intermittent multicast traffic and some tips to fix it:
- wrong bind address (either xml or system property)
- low TTL configuration
- on linux/unix, routing table misconfigured.
- firewall at nodes and/or router blocking multicast traffic.
- multicast disabled in the router/switch.
- switch temporarily discarding packets on a port due to IGMP snooping.
- potentially an IPv6 issue, try running with -Djava.net.preferIPv4Stack=true
- try connecting all machines in the cluster to the same hub rather than connecting them through a router or switch -> good in case there's some obscure setting in the router/switch.