Okay, I did some further research in this.
I set up a sniffer in my LAN segment to see what's going on - I first suspected that the UDP packet won't come through.
However, the sniffer tells me that the UDP packets from both JBoss servers go through the network and also reach the interface of each machine. I also setup a sniffer on my Win2k machine with the same result.
So, here is the question: why does JBoss running on Win2k and Win98 join the cluster, while JBoss running on two Linux boxes see each other's broadcast packets, but simply ignore it and don't join the cluster?
I am clueless? Is there anything else besides the entries in the routing table that I have to watch out for?
Any help is more than welcome,
Do you have IP multicasting enabled in your router and machines? JBoss clustering uses IP multicasting as default. You can modify this behavior in cluster-service.xml.
thanks for your answer.
My answer to your question is - Yes. I do have multicasting enabled on all machines (router and machines).
I have one question though: you mentioned that the multicast behavior should be somehow controllable in cluster-service.xml? Where and how exactly? The default file doesn't show anything like this. I am about to buy the clustering book, but haven't received it yet. Maybe you can give me a quickstart on this. :)
Also, as I tried to explain earlier, I have Novell eDirectory (an LDAP compliant directory) running on the Linux boxes and it has a component called SLP to broadcast IP packages to determine if there are other eDirectories in the "neighborhood". This works just fine. Also, according to my sniffer exercise, the multicast packages actually reach the interface of the Linux boxes. Somehow, JBoss just ignores this.
I spent more than a full day to find out what's wrong, but I am stuck now.
I definitely would prefer to have the environment on Linux, but if I don't get it to work soon, I might have to change plans and implement our solution on Win2k. :(
Just to add on...
I did the sniffing exercise again. After I started my JBoss instance on my Win2k box, I can see the multicast packages from it appearing on eth0 on my Linux box. Once I start the JBoss instance on my Linux box, I also see the multicast packages from this box as well. The multicast gets sent out about every 10 seconds. Although both multicast packages can be seen on my Linux interface, JBoss doesn't seem to recognize it and refuses to join the cluster - both instances operate isolated.
The same behavior, if I start JBoss on Linux first and then on Win2k.
However, if I start a JBoss instance on any other Windows box (I've tested WinXP, WinME and Win2k so far), it'll happily join the cluster with the instance started on the first Win2k box.
I also have the issue if I start two JBoss instances on two Linux boxes - they simply refuse to join the cluster.
Just to make it clear: I have another application running on the Linux boxes that uses multicast and this application just works fine. (BTW, JBoss clustering also doesn't work, if I stop the other app on Linux, so no interference here...)
I am clueless... especially since it seem to work with other people on other Linuxes.
I remember that I had an early JBoss 3 Beta running on the Linux boxes and if I recall it correctly, there clustering worked.
To give you some complete information: I am running SuSE Linux 7.3 with the 2.4.4 Kernel.
Can anyone help...?
Maybe this is a bug in the linux kernel you are using (2.4.4 is quite old). We are running RedHat Linux 7.2 with a 2.4.9 kernel. Everything works.
At home I have SuSE 7.4 with a 2.4.16. It works as expected.
Another question: Which JDK are you using? Maybe the JDK has problems with multicasting. You mentioned that other multicast applications work well.
We are using Sun's JDK 1.4.0_01. JDK 1.3.1_03 worked well, too.
You can change the way the cluster communicates in cluster-service.xml:
<attribute name=”PartitionProperties”> UDP mcast_addr=126.96.36.199;mcast_port=45566):PING:FD(timeout=5000):VERIFY_SUSPECT(timeout=1500):MERGE:NAKACK:UNICAST(timeout=5000;min_wait_time=2000):FRAG:FLUSH:GMS:STATE_TRANSFER:QUEUE
See http://www.javagroups.com for details of the parameters.
Thanks a lot - that's at least a starting point!
We are using JDK 1.3.1_03 as well and I got the Kernel wrong - SuSE 7.3 comes with 188.8.131.52GB which is newer than your 2.4.9.
I am going to explore this a bit further.
can you try out the Draw demo ?
on both machines, and they should see each other.
If this doesn't work, enable tracing for JavaGroups: put the following in your javagroups.properties (which has to be in your home dir or in the CLASSPATH):
Send me (email@example.com) the 2 trace files (from 2 different machines, so I can have a look.
after quite some effort and hours to investigate and debugging and with the active support from Bela (BIG THANKS!), I guess I discovered an issue/bug with the current JBoss clustering.
The issue manifested itself as described in my posting before - JBoss on Win2k or Linux won't join the default cluster partition of an already running JBoss instance on Linux. Just to re-iterate, I verified that my network supported multicasts.
So, here is what I found:
1. There must be an issue with the javagroups-2.0.jar that is bundeled with the current JBoss3.0.0 download package from May 31st. If I get the latest javagroups source from CVS, compile it and replace the 'old' jar with the newly compiled, everything seems to be OK and working in any configuration/combination ... except,
2. when running on Linux, a paramter "bind_addr" should be used to use 'eth0' instead of 'lo' in my case. When I initially installed Linux, I didn't have an active network connection. In this case, only 'lo' got activated on Linux and this will be the first network interface. Now, javagroups seems to always try to bind to this 'first' interface, 'lo' in my case. As a result, no multicast were send to the network :(.
So, I added the following parameters to my 'cluster-service.xml' to make it work:
Please note that these are the exact default protocol parameters as given in JBoss' "ClusterPartition.java", except for the additional paramter 'bind_addr='.
Everything seems to work for me now, at least for now.
I hope I could help some other folks within the JBoss community with this report.