5 Replies Latest reply on Nov 13, 2007 3:49 AM by akostadinov

    Testsuite issues with multicast routing (JBAS-4939)

    brian.stansberry

      Continuing discussion from http://jira.jboss.com/jira/browse/JBAS-4939

      "Alexander Kostadinov" wrote:

      NOTE: JBoss 4.2 cluster tests pass even when one node is bound to localhost


      On dev90 when a server is bound to localhost? We already established that the problem does not exist on qa01. This is an important question -- if clustering tests work on dev90 with 4.2 but not with AS 5, there is something more we need to understand.


      Solutions:
      1. Use only IPs on the same interface that multicast route goes through


      Dimitris in the JBAS-4939 description says "Eventually, we should get rid of the 'noip' run." Sounds like you guys are saying the same thing -- don't use -b localhost. I have no objection, as long as we can confirm that there is no change between JGroups 2.4.1.x (AS 4.2) and JGroups 2.5+ (AS 5) that we're sweeping under a rug.

      I don't see any major QA benefit to running tests bound to localhost; real servers are not likely to run that way. Seems to me that the default testsuite behavior of binding one node to localhost and one to the machine name is just a convenience for developers; no reason QA servers should use that default. (And it's not even a real convenience for devs if the dev uses windows or, it seems, RHEL 4U5+).

      2. Disable the default behavior to bind the socket to an interface but let OS choose the default one. If user forces the socket to be bound to specific interface, then bind to it. IMHO that is how things should work. And we will avoid the need user to mess with multicast configuration and adding IPs to be able to run the test suite.


      Please open a thread on Clustering Dev forum to discuss. What you're saying makes sense with respect to running the testsuite, but there are other issues to consider that go well beyond the scope of this JIRA.


      The other issue of server start-up being very slow because messages sent are not seen by the sender I think that is somehow related to the above one issue and the change in behavior after RHEL 4U4. So one of the above hopefully fix that:


      Yes, this is absolutely correct. When the JGroups channel cannot receive its own multicast messages, it's FLUSH protocol is blocking waiting for them for 15-20-30 seconds. This happens several times during the course of startup and shutdown.

      An issue I need to explore with the JGroups team is whether we can just make the channel connection (and hence server start) fail altogether in this situation.

      set the socket option IP_MULTICAST_LOOP if not already set


      Per the java.net.MulticastSocket javadocs this is only a hint to the underlying platform, so it's not reliable. I'll let Bela comment beyond that. In any case it only solves part of the problem -- multicast communication has to work to the other nodes, not just receipt of ones own messages.

      And no, setting multicast route through lo is not a solution because that way mcast messages will be seen on the local loopback listener but not by the listener on eth#


      Fair enough. But this and your description of the RHEL 4U5+ behavior implies that in RHEL 4U5, multicast can only work on a single interface in a machine, which seems strange.


      Bottom line, IMHO these testsuite should be run with node0 bound to $MYTESTIP_1 and node1 to $MYTESTIP_2.

        • 1. Re: Testsuite issues with multicast routing (JBAS-4939)
          akostadinov

          Let me summarize the issue again. On RHEL 4U4 and below seems like locally sent multicast messages are received if listener is bound to the same interface as sender no matter if multicast route points to another interface.

          On RHEL 4U5 and above that is not the case so if sender and receiver are bound to the same interface but that is not the one with multicast route set through, receiver can't see anything. That is doubtfully sane.

          I think that exactly is causing the server bound to localhost to be unable to see its sent messages. So even if IP_MULTICAST_LOOP is just a hint, we have to set it anyway as long as we rely on that behavior and have to request it. I suggest we try that first to see if it will fix that particular issue.

          To make things work now, both servers have to bind their multicast sockets (only the multicast ones) to the same interface which mcast route goes through. For me that is sometimes inconvenient sometimes and error prone.

          I don't know how much sense does it make for JGroups itself to to bind or not bind multicast sockets to particular interface, but specifying interface is bad for JBoss AS manageability. Imagine server administrator is managing server interfaces and configures things to go through another one. That way the AS server will stop working properly.
          On the other hand if JBoss multicast socket is not bound to specific interface, it is highly unlikely that produce any issues. And even if it does in very specific cases, the user can specify interface manually.

          IMHO the safe choice is to not specify any interface for mcast socket and that will produce the "just work" feeling when one tries to run the test suite. I know some clients like to run the test suite locally and not having issues with it will make them more confident in JBoss AS quality. My feeling is that we'll have less issues with hudson runs as well ;)

          So two things we can do:
          1. Ignore and disable noip run
          or
          2. Try if IP_MULTICAST_LOOP fixes one of the issues *and* disable binding to specific interface for mcast sockets unless user requested that.

          P.S.


          On dev90 when a server is bound to localhost? We already established that the problem does not exist on qa01. This is an important question -- if clustering tests work on dev90 with 4.2 but not with AS 5, there is something more we need to understand.

          Difference between dev90 and qa01 if that one is RHEL 4U5 and the other 4U4. I will try clustering tests of AS 4.2 on dev90 with node0=localhost


          • 2. Re: Testsuite issues with multicast routing (JBAS-4939)
            akostadinov

            I've started a topic in the clustering forum
            http://www.jboss.com/index.html?module=bb&op=viewtopic&t=123075

            And I verified that JBoss AS 4.2 doesn't have any issues with the clustering test suite on RHEL 4U5. That was run on dev03 with command "./build.sh tests-clustering -Dnode0=localhost -Dnode1=`hostname` -DudpGroup=$MCAST_ADDR"

            • 3. Re: Testsuite issues with multicast routing (JBAS-4939)
              akostadinov

              I think we have no more test suite issue. I'll comment on JIRA tomorrow, once the test suite run is completed.

              • 4. Re: Testsuite issues with multicast routing (JBAS-4939)
                brian.stansberry

                I believe Adrian found the problem -- IPv6: see http://www.jboss.com/index.html?module=bb&op=viewtopic&t=123463 .

                I've updated testsuite/imports/server-configs.xml to set java.net.preferIPv4Stack=true on all configs.

                If I run the JGroups McastSenderTest and McastReceiverTest I can send messages from a process bound to localhost to a process bound to $MYTESTIP_1 on a number of QA lab machines. But a process bound to localhost cannot receive the messages sent from localhost. Specifying -Djava.net.preferIPv4Stack=true solves the problem.

                • 5. Re: Testsuite issues with multicast routing (JBAS-4939)
                  akostadinov

                  Hmm, on which host did you tried that? On dev90 I run two processes:
                  java -Djava.net.preferIPv4Stack=true -cp jgroups.jar org.jgroups.tests.McastSenderTest -mcast_addr 224.10.10.10 -bind_addr localhost
                  and
                  java -Djava.net.preferIPv4Stack=true -cp jgroups.jar org.jgroups.tests.McastReceiverTest -mcast_addr 224.10.10.10 -port 5555 -bind_addr localhost

                  And the receiver doesn't see anything. I don't say there is no issues with ipv6 but there still seems to be a change in the behavior with RHEL 4U5 and above. I've opened a bugzilla about that https://bugzilla.redhat.com/show_bug.cgi?id=369591