clustering problems - nodes fail to cluster
spambob Jul 27, 2005 2:47 PMI have the following situation, two machines (called bs-laptop and bs-desktop),
both of which are trying to cluster together, using Jboss 3.2.5. Both machines
have "bs.playsecond.com" as their virtual host.
The /etc/hosts files read as follows. On bs-desktop:
192.168.253.47 bs.playsecond.com bs-desktop.playsecond.com bs-desktop
192.168.253.46 bs-laptop.playsecond.com bs-laptop
On bs-laptop:
192.168.253.47 bs-desktop.playsecond.com bs-desktop
192.168.253.46 bs.playsecond.com bs-laptop.playsecond.com bs-laptop
netstat -nr on both machines:
bs-desktop% netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
192.168.253.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
127.0.0.0 127.0.0.1 255.0.0.0 UG 0 0 0 lo
224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 eth1
0.0.0.0 192.168.253.1 0.0.0.0 UG 0 0 0 eth1
bs-laptop% netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
192.168.253.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo
224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 eth0
0.0.0.0 192.168.253.1 0.0.0.0 UG 0 0 0 eth0
th1
When I run the ViewDemo application, both machines connect and I get the following results:
bs-desktop% java -cp ".:../server/all/lib/jgroups.jar:../server/all/lib/commons-logging.jar" org.jgroups.demos.ViewDemo
-------------------------------------------------------
GMS: address is bs:34891
-------------------------------------------------------
** New view: [bs:34891|0] [bs:34891]
** New view: [bs:34891|1] [bs:34891, bs-laptop:32786]
bs-laptop% java -cp ".:../server/all/lib/jgroups.jar:../server/all/lib/commons-logging.jar" org.jgroups.demos.ViewDemo
-------------------------------------------------------
GMS: address is bs:32786
-------------------------------------------------------
** New view: [bs-desktop:34891|1] [bs-desktop:34891, bs:32786]
So far, so good. However, when I start jboss, the machines do not find each other. The logs read:
on bs-desktop:
2005-07-27 10:49:52,315 INFO [org.jgroups.conf.ConfiguratorFactory] properties are neither a URL nor a file
2005-07-27 10:49:52,588 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Initializing
2005-07-27 10:49:52,718 INFO [org.jgroups.protocols.UDP] unicast sockets will use interface 192.168.253.47
2005-07-27 10:49:52,722 INFO [org.jgroups.protocols.UDP] socket information:
local_addr=bs:34892 (additional data: 19 bytes), mcast_addr=228.1.2.3:45566, bind_addr=/192.168.253.47, ttl=32
socket: bound to 192.168.253.47:34892, receive buffer size=131071, send buffer size=131071
multicast socket: bound to 192.168.253.47:45566, send buffer size=131071, receive buffer size=131071
2005-07-27 10:49:52,724 INFO [STDOUT]
-------------------------------------------------------
GMS: address is bs:34892 (additional data: 19 bytes)
-------------------------------------------------------
2005-07-27 10:49:54,771 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Number of cluster members: 1
2005-07-27 10:49:54,771 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Other members: 0
2005-07-27 10:49:54,771 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Fetching state (will wait for 60000 milliseconds):
2005-07-27 10:49:54,773 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view (id: 0, delta: 0) : [192.168.253
.47:1099]
2005-07-27 10:49:54,774 INFO [DefaultPartition:ReplicantManager] Dead members: 0
2005-07-27 10:49:57,070 INFO [org.jboss.ha.jndi.HANamingService] Listening on /0.0.0.0:1100
2005-07-27 10:49:57,074 INFO [org.jboss.ha.jndi.DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=230.0.0.4, HA-JNDI address=
192.168.253.47:1100
2005-07-27 10:49:57,432 INFO [org.apache.catalina.startup.Embedded] Catalina naming disabled
on bs-laptop:
2005-07-27 10:49:46,656 INFO [org.jgroups.conf.ConfiguratorFactory] properties are neither a URL nor a file
2005-07-27 10:49:46,887 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Initializing
2005-07-27 10:49:47,039 INFO [org.jgroups.protocols.UDP] unicast sockets will use interface 192.168.253.46
2005-07-27 10:49:47,043 INFO [org.jgroups.protocols.UDP] socket information:
local_addr=bs:32787 (additional data: 19 bytes), mcast_addr=228.1.2.3:45566, bind_addr=/192.168.253.46, ttl=32
socket: bound to 192.168.253.46:32787, receive buffer size=131071, send buffer size=131071
multicast socket: bound to 192.168.253.46:45566, send buffer size=131071, receive buffer size=131071
2005-07-27 10:49:47,046 INFO [STDOUT]
-------------------------------------------------------
GMS: address is bs:32787 (additional data: 19 bytes)
-------------------------------------------------------
2005-07-27 10:49:49,076 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Number of cluster members: 1
2005-07-27 10:49:49,077 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Other members: 0
2005-07-27 10:49:49,077 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Fetching state (will wait for 60000 milliseconds):
2005-07-27 10:49:49,077 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view (id: 0, delta: 0) : [192.168.253
.46:1099]
2005-07-27 10:49:49,083 INFO [DefaultPartition:ReplicantManager] Dead members: 0
2005-07-27 10:49:49,778 INFO [org.jboss.ha.jndi.HANamingService] Listening on /0.0.0.0:1100
2005-07-27 10:49:49,783 INFO [org.jboss.ha.jndi.DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=230.0.0.4, HA-JNDI address=
192.168.253.46:1100
2005-07-27 10:49:50,080 INFO [org.apache.catalina.startup.Embedded] Catalina naming disabled
I know the multicast routes are correct, because ViewDemo works, and the
bind info in the logs seems right... why aren't the servers connecting?
Neither machine is dual-homed; one has eth0 and lo, the other has eth1
and lo. I start jboss with "run.sh -b 0.0.0.0 -c all".
After startup, I see this in the logs quite a bit:
2005-07-27 10:54:03,058 WARN [org.jgroups.protocols.UDP] discarded message from different group (TreeCache-Cluster). Sender was bs:34898
2005-07-27 10:54:04,016 WARN [org.jgroups.protocols.UDP] discarded message from different group (TreeCache-Cluster). Sender was bs:34898
2005-07-27 10:54:04,031 WARN [org.jgroups.protocols.UDP] discarded message from different group (DefaultPartition). Sender was bs:34896 (additional data:
19 bytes)
The cluster-service.xml file is unchanged from the distribution; it reads:
<!-- The JGroups protocol configuration --> <attribute name="PartitionConfig"> <Config> <!-- UDP: if you have a multihomed machine, set the bind_addr attribute to the appropriate NIC IP address --> <!-- UDP: On Windows machines, because of the media sense feature being broken with multicast (even after disabling media sense) set the loopback attribute to true --> <UDP mcast_addr="228.1.2.3" mcast_port="45566" ip_ttl="32" ip_mcast="true" mcast_send_buf_size="800000" mcast_recv_buf_size="150000" ucast_send_buf_size="800000" ucast_recv_buf_size="150000" loopback="false" /> <PING timeout="2000" num_initial_members="3" up_thread="true" down_thread="true" /> <MERGE2 min_interval="10000" max_interval="20000" /> <FD shun="true" up_thread="true" down_thread="true" timeout="2500" max_tries="5" /> <VERIFY_SUSPECT timeout="3000" num_msgs="3" up_thread="true" down_thread="true" /> <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800" max_xmit_size="8192" up_thread="true" down_thread="true" /> <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10" down_thread="true" /> <pbcast.STABLE desired_avg_gossip="20000" up_thread="true" down_thread="true" /> <FRAG frag_size="8192" down_thread="true" up_thread="true" /> <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true" /> <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" /> </Config> </attribute>