10 Replies Latest reply on Jul 16, 2002 11:27 AM by derry

    Clustering problem on Linux

      Hi Folks,

      I've got a strange issue her. I am sure it is not a JBoss issue, but it has to do with JBoss clustering on Linux.

      When I am starting JBoss 3.0.0 on my two Linux servers, they are just running as standalone. Apparently they don't see each other, although they are on the same LAN segment and have both entries in the routing table (224.0.0.0 mask 240.0.0.0).
      I also run Novell eDirectory on the same machines which uses slp to get information about other directory trees using UDP - this works.
      The same happens when I start JBoss on one of the Linux servers and on a Win2k server - no clustering.
      However, when starting a second JBoss instance on my "old" Win98 machine, the two JBoss instances happily join the default cluster.

      Can anybody explain what I am missing here? I guess it is just some small trivial thing but I already wasted an entire day to get it working.

      Thanks,
      -Uli

        • 1. Re: Clustering problem on Linux

          Okay, I did some further research in this.

          I set up a sniffer in my LAN segment to see what's going on - I first suspected that the UDP packet won't come through.

          However, the sniffer tells me that the UDP packets from both JBoss servers go through the network and also reach the interface of each machine. I also setup a sniffer on my Win2k machine with the same result.

          So, here is the question: why does JBoss running on Win2k and Win98 join the cluster, while JBoss running on two Linux boxes see each other's broadcast packets, but simply ignore it and don't join the cluster?

          I am clueless? Is there anything else besides the entries in the routing table that I have to watch out for?

          Any help is more than welcome,

          -Uli

          • 2. Re: Clustering problem on Linux
            derry

            Do you have IP multicasting enabled in your router and machines? JBoss clustering uses IP multicasting as default. You can modify this behavior in cluster-service.xml.

            CU
            Thomas

            • 3. Re: Clustering problem on Linux

              Thomas,

              thanks for your answer.
              My answer to your question is - Yes. I do have multicasting enabled on all machines (router and machines).

              I have one question though: you mentioned that the multicast behavior should be somehow controllable in cluster-service.xml? Where and how exactly? The default file doesn't show anything like this. I am about to buy the clustering book, but haven't received it yet. Maybe you can give me a quickstart on this. :)

              Also, as I tried to explain earlier, I have Novell eDirectory (an LDAP compliant directory) running on the Linux boxes and it has a component called SLP to broadcast IP packages to determine if there are other eDirectories in the "neighborhood". This works just fine. Also, according to my sniffer exercise, the multicast packages actually reach the interface of the Linux boxes. Somehow, JBoss just ignores this.

              I spent more than a full day to find out what's wrong, but I am stuck now.
              I definitely would prefer to have the environment on Linux, but if I don't get it to work soon, I might have to change plans and implement our solution on Win2k. :(

              -Uli

              • 4. Re: Clustering problem on Linux

                Just to add on...

                I did the sniffing exercise again. After I started my JBoss instance on my Win2k box, I can see the multicast packages from it appearing on eth0 on my Linux box. Once I start the JBoss instance on my Linux box, I also see the multicast packages from this box as well. The multicast gets sent out about every 10 seconds. Although both multicast packages can be seen on my Linux interface, JBoss doesn't seem to recognize it and refuses to join the cluster - both instances operate isolated.
                The same behavior, if I start JBoss on Linux first and then on Win2k.

                However, if I start a JBoss instance on any other Windows box (I've tested WinXP, WinME and Win2k so far), it'll happily join the cluster with the instance started on the first Win2k box.

                I also have the issue if I start two JBoss instances on two Linux boxes - they simply refuse to join the cluster.

                Just to make it clear: I have another application running on the Linux boxes that uses multicast and this application just works fine. (BTW, JBoss clustering also doesn't work, if I stop the other app on Linux, so no interference here...)

                I am clueless... especially since it seem to work with other people on other Linuxes.

                I remember that I had an early JBoss 3 Beta running on the Linux boxes and if I recall it correctly, there clustering worked.

                To give you some complete information: I am running SuSE Linux 7.3 with the 2.4.4 Kernel.

                Can anyone help...?

                Thanks,

                -Uli

                • 5. Re: Clustering problem on Linux
                  derry

                  Maybe this is a bug in the linux kernel you are using (2.4.4 is quite old). We are running RedHat Linux 7.2 with a 2.4.9 kernel. Everything works.

                  At home I have SuSE 7.4 with a 2.4.16. It works as expected.

                  Another question: Which JDK are you using? Maybe the JDK has problems with multicasting. You mentioned that other multicast applications work well.

                  We are using Sun's JDK 1.4.0_01. JDK 1.3.1_03 worked well, too.

                  • 6. Re: Clustering problem on Linux
                    derry

                    You can change the way the cluster communicates in cluster-service.xml:

                    <attribute name=”PartitionProperties”> UDP mcast_addr=228.1.2.3;mcast_port=45566):PING:FD(timeout=5000):VERIFY_SUSPECT(timeout=1500):MERGE:NAKACK:UNICAST(timeout=5000;min_wait_time=2000):FRAG:FLUSH:GMS:STATE_TRANSFER:QUEUE



                    See http://www.javagroups.com for details of the parameters.

                    • 7. Re: Clustering problem on Linux

                      Thanks a lot - that's at least a starting point!

                      We are using JDK 1.3.1_03 as well and I got the Kernel wrong - SuSE 7.3 comes with 2.4.10.4GB which is newer than your 2.4.9.

                      I am going to explore this a bit further.

                      Cheers,
                      -Uli

                      • 8. Re: Clustering problem on Linux
                        belaban

                        Uli,

                        can you try out the Draw demo ?

                        java org.javagroups.demos.Draw

                        on both machines, and they should see each other.

                        If this doesn't work, enable tracing for JavaGroups: put the following in your javagroups.properties (which has to be in your home dir or in the CLASSPATH):

                        trace=true
                        timestamp_format=HH:mm:ss[SSS]
                        default_output=DEBUG /tmp/trace.all
                        Send me (belaban@yahoo.com) the 2 trace files (from 2 different machines, so I can have a look.
                        Cheers,

                        Bela

                        • 9. Re: Clustering problem on Linux

                          Okay folks,

                          after quite some effort and hours to investigate and debugging and with the active support from Bela (BIG THANKS!), I guess I discovered an issue/bug with the current JBoss clustering.

                          The issue manifested itself as described in my posting before - JBoss on Win2k or Linux won't join the default cluster partition of an already running JBoss instance on Linux. Just to re-iterate, I verified that my network supported multicasts.

                          So, here is what I found:
                          1. There must be an issue with the javagroups-2.0.jar that is bundeled with the current JBoss3.0.0 download package from May 31st. If I get the latest javagroups source from CVS, compile it and replace the 'old' jar with the newly compiled, everything seems to be OK and working in any configuration/combination ... except,
                          2. when running on Linux, a paramter "bind_addr" should be used to use 'eth0' instead of 'lo' in my case. When I initially installed Linux, I didn't have an active network connection. In this case, only 'lo' got activated on Linux and this will be the first network interface. Now, javagroups seems to always try to bind to this 'first' interface, 'lo' in my case. As a result, no multicast were send to the network :(.
                          So, I added the following parameters to my 'cluster-service.xml' to make it work:




                          UDP(mcast_addr=228.8.8.8;mcast_port=45566;bind_addr=<ip-addr-of-your-if>;ip_ttl=32;mcast_send_buf_size=32000;mcast_recv_buf_size=64000):PING(timeout=2000;num_initial_members=3):MERGE2(min_interval=5000;max_interval=10000):FD:VERIFY_SUSPECT(timeout=1500):pbcast.STABLE(desired_avg_gossip=20000):pbcast.NAKACK(gc_lag=50;retransmit_timeout=300,600,1200,2400,4800):UNICAST(timeout=1200):FRAG(down_thread=false;up_thread=false):pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=false;print_local_addr=true)




                          Please note that these are the exact default protocol parameters as given in JBoss' "ClusterPartition.java", except for the additional paramter 'bind_addr='.

                          Everything seems to work for me now, at least for now.
                          I hope I could help some other folks within the JBoss community with this report.

                          Cheers,
                          -Uli

                          • 10. Re: Clustering problem on Linux
                            derry

                            Please write a bug report at http://sourceforge.net/projects/jboss/. This is a bug we must work on...

                            Thanks for your analysis!

                            CU
                            Thomas