Testsuite issues with multicast routing (JBAS-4939)
brian.stansberry Nov 6, 2007 1:54 AMContinuing discussion from http://jira.jboss.com/jira/browse/JBAS-4939
"Alexander Kostadinov" wrote:
NOTE: JBoss 4.2 cluster tests pass even when one node is bound to localhost
On dev90 when a server is bound to localhost? We already established that the problem does not exist on qa01. This is an important question -- if clustering tests work on dev90 with 4.2 but not with AS 5, there is something more we need to understand.
Solutions:
1. Use only IPs on the same interface that multicast route goes through
Dimitris in the JBAS-4939 description says "Eventually, we should get rid of the 'noip' run." Sounds like you guys are saying the same thing -- don't use -b localhost. I have no objection, as long as we can confirm that there is no change between JGroups 2.4.1.x (AS 4.2) and JGroups 2.5+ (AS 5) that we're sweeping under a rug.
I don't see any major QA benefit to running tests bound to localhost; real servers are not likely to run that way. Seems to me that the default testsuite behavior of binding one node to localhost and one to the machine name is just a convenience for developers; no reason QA servers should use that default. (And it's not even a real convenience for devs if the dev uses windows or, it seems, RHEL 4U5+).
2. Disable the default behavior to bind the socket to an interface but let OS choose the default one. If user forces the socket to be bound to specific interface, then bind to it. IMHO that is how things should work. And we will avoid the need user to mess with multicast configuration and adding IPs to be able to run the test suite.
Please open a thread on Clustering Dev forum to discuss. What you're saying makes sense with respect to running the testsuite, but there are other issues to consider that go well beyond the scope of this JIRA.
The other issue of server start-up being very slow because messages sent are not seen by the sender I think that is somehow related to the above one issue and the change in behavior after RHEL 4U4. So one of the above hopefully fix that:
Yes, this is absolutely correct. When the JGroups channel cannot receive its own multicast messages, it's FLUSH protocol is blocking waiting for them for 15-20-30 seconds. This happens several times during the course of startup and shutdown.
An issue I need to explore with the JGroups team is whether we can just make the channel connection (and hence server start) fail altogether in this situation.
set the socket option IP_MULTICAST_LOOP if not already set
Per the java.net.MulticastSocket javadocs this is only a hint to the underlying platform, so it's not reliable. I'll let Bela comment beyond that. In any case it only solves part of the problem -- multicast communication has to work to the other nodes, not just receipt of ones own messages.
And no, setting multicast route through lo is not a solution because that way mcast messages will be seen on the local loopback listener but not by the listener on eth#
Fair enough. But this and your description of the RHEL 4U5+ behavior implies that in RHEL 4U5, multicast can only work on a single interface in a machine, which seems strange.
Bottom line, IMHO these testsuite should be run with node0 bound to $MYTESTIP_1 and node1 to $MYTESTIP_2.