4 Replies Latest reply on Aug 1, 2005 6:50 PM by spambob

    clustering problems - nodes fail to cluster

    spambob

      I have the following situation, two machines (called bs-laptop and bs-desktop),
      both of which are trying to cluster together, using Jboss 3.2.5. Both machines
      have "bs.playsecond.com" as their virtual host.

      The /etc/hosts files read as follows. On bs-desktop:

      192.168.253.47 bs.playsecond.com bs-desktop.playsecond.com bs-desktop
      192.168.253.46 bs-laptop.playsecond.com bs-laptop

      On bs-laptop:

      192.168.253.47 bs-desktop.playsecond.com bs-desktop
      192.168.253.46 bs.playsecond.com bs-laptop.playsecond.com bs-laptop

      netstat -nr on both machines:

      bs-desktop% netstat -nr
      Kernel IP routing table
      Destination Gateway Genmask Flags MSS Window irtt Iface
      192.168.253.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
      127.0.0.0 127.0.0.1 255.0.0.0 UG 0 0 0 lo
      224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 eth1
      0.0.0.0 192.168.253.1 0.0.0.0 UG 0 0 0 eth1

      bs-laptop% netstat -nr
      Kernel IP routing table
      Destination Gateway Genmask Flags MSS Window irtt Iface
      192.168.253.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
      127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo
      224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 eth0
      0.0.0.0 192.168.253.1 0.0.0.0 UG 0 0 0 eth0
      th1

      When I run the ViewDemo application, both machines connect and I get the following results:

      bs-desktop% java -cp ".:../server/all/lib/jgroups.jar:../server/all/lib/commons-logging.jar" org.jgroups.demos.ViewDemo
      -------------------------------------------------------
      GMS: address is bs:34891
      -------------------------------------------------------
      ** New view: [bs:34891|0] [bs:34891]
      ** New view: [bs:34891|1] [bs:34891, bs-laptop:32786]

      bs-laptop% java -cp ".:../server/all/lib/jgroups.jar:../server/all/lib/commons-logging.jar" org.jgroups.demos.ViewDemo
      -------------------------------------------------------
      GMS: address is bs:32786
      -------------------------------------------------------
      ** New view: [bs-desktop:34891|1] [bs-desktop:34891, bs:32786]

      So far, so good. However, when I start jboss, the machines do not find each other. The logs read:

      on bs-desktop:

      2005-07-27 10:49:52,315 INFO [org.jgroups.conf.ConfiguratorFactory] properties are neither a URL nor a file
      2005-07-27 10:49:52,588 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Initializing
      2005-07-27 10:49:52,718 INFO [org.jgroups.protocols.UDP] unicast sockets will use interface 192.168.253.47
      2005-07-27 10:49:52,722 INFO [org.jgroups.protocols.UDP] socket information:
      local_addr=bs:34892 (additional data: 19 bytes), mcast_addr=228.1.2.3:45566, bind_addr=/192.168.253.47, ttl=32
      socket: bound to 192.168.253.47:34892, receive buffer size=131071, send buffer size=131071
      multicast socket: bound to 192.168.253.47:45566, send buffer size=131071, receive buffer size=131071
      2005-07-27 10:49:52,724 INFO [STDOUT]
      -------------------------------------------------------
      GMS: address is bs:34892 (additional data: 19 bytes)
      -------------------------------------------------------
      2005-07-27 10:49:54,771 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Number of cluster members: 1
      2005-07-27 10:49:54,771 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Other members: 0
      2005-07-27 10:49:54,771 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Fetching state (will wait for 60000 milliseconds):
      2005-07-27 10:49:54,773 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view (id: 0, delta: 0) : [192.168.253
      .47:1099]
      2005-07-27 10:49:54,774 INFO [DefaultPartition:ReplicantManager] Dead members: 0
      2005-07-27 10:49:57,070 INFO [org.jboss.ha.jndi.HANamingService] Listening on /0.0.0.0:1100
      2005-07-27 10:49:57,074 INFO [org.jboss.ha.jndi.DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=230.0.0.4, HA-JNDI address=
      192.168.253.47:1100
      2005-07-27 10:49:57,432 INFO [org.apache.catalina.startup.Embedded] Catalina naming disabled
      on bs-laptop:

      2005-07-27 10:49:46,656 INFO [org.jgroups.conf.ConfiguratorFactory] properties are neither a URL nor a file
      2005-07-27 10:49:46,887 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Initializing
      2005-07-27 10:49:47,039 INFO [org.jgroups.protocols.UDP] unicast sockets will use interface 192.168.253.46
      2005-07-27 10:49:47,043 INFO [org.jgroups.protocols.UDP] socket information:
      local_addr=bs:32787 (additional data: 19 bytes), mcast_addr=228.1.2.3:45566, bind_addr=/192.168.253.46, ttl=32
      socket: bound to 192.168.253.46:32787, receive buffer size=131071, send buffer size=131071
      multicast socket: bound to 192.168.253.46:45566, send buffer size=131071, receive buffer size=131071
      2005-07-27 10:49:47,046 INFO [STDOUT]
      -------------------------------------------------------
      GMS: address is bs:32787 (additional data: 19 bytes)
      -------------------------------------------------------
      2005-07-27 10:49:49,076 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Number of cluster members: 1
      2005-07-27 10:49:49,077 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Other members: 0
      2005-07-27 10:49:49,077 INFO [org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition] Fetching state (will wait for 60000 milliseconds):
      2005-07-27 10:49:49,077 INFO [org.jboss.ha.framework.interfaces.HAPartition.lifecycle.DefaultPartition] New cluster view (id: 0, delta: 0) : [192.168.253
      .46:1099]
      2005-07-27 10:49:49,083 INFO [DefaultPartition:ReplicantManager] Dead members: 0
      2005-07-27 10:49:49,778 INFO [org.jboss.ha.jndi.HANamingService] Listening on /0.0.0.0:1100
      2005-07-27 10:49:49,783 INFO [org.jboss.ha.jndi.DetachedHANamingService$AutomaticDiscovery] Listening on /0.0.0.0:1102, group=230.0.0.4, HA-JNDI address=
      192.168.253.46:1100
      2005-07-27 10:49:50,080 INFO [org.apache.catalina.startup.Embedded] Catalina naming disabled

      I know the multicast routes are correct, because ViewDemo works, and the
      bind info in the logs seems right... why aren't the servers connecting?
      Neither machine is dual-homed; one has eth0 and lo, the other has eth1
      and lo. I start jboss with "run.sh -b 0.0.0.0 -c all".

      After startup, I see this in the logs quite a bit:

      2005-07-27 10:54:03,058 WARN [org.jgroups.protocols.UDP] discarded message from different group (TreeCache-Cluster). Sender was bs:34898
      2005-07-27 10:54:04,016 WARN [org.jgroups.protocols.UDP] discarded message from different group (TreeCache-Cluster). Sender was bs:34898
      2005-07-27 10:54:04,031 WARN [org.jgroups.protocols.UDP] discarded message from different group (DefaultPartition). Sender was bs:34896 (additional data:
      19 bytes)

      The cluster-service.xml file is unchanged from the distribution; it reads:

      
       <!-- The JGroups protocol configuration -->
       <attribute name="PartitionConfig">
       <Config>
       <!-- UDP: if you have a multihomed machine,
       set the bind_addr attribute to the appropriate NIC IP address -->
       <!-- UDP: On Windows machines, because of the media sense feature
       being broken with multicast (even after disabling media sense)
       set the loopback attribute to true -->
       <UDP mcast_addr="228.1.2.3" mcast_port="45566"
       ip_ttl="32" ip_mcast="true"
       mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
       ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
       loopback="false" />
       <PING timeout="2000" num_initial_members="3"
       up_thread="true" down_thread="true" />
       <MERGE2 min_interval="10000" max_interval="20000" />
       <FD shun="true" up_thread="true" down_thread="true"
       timeout="2500" max_tries="5" />
       <VERIFY_SUSPECT timeout="3000" num_msgs="3"
       up_thread="true" down_thread="true" />
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
       max_xmit_size="8192"
       up_thread="true" down_thread="true" />
       <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
       down_thread="true" />
       <pbcast.STABLE desired_avg_gossip="20000"
       up_thread="true" down_thread="true" />
       <FRAG frag_size="8192"
       down_thread="true" up_thread="true" />
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
       shun="true" print_local_addr="true" />
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />
       </Config>
       </attribute>
      
      
      
      






        • 1. Re: clustering problems - nodes fail to cluster
          jiwils

          Are you starting JBoss on both boxes at close to the same time? Previous versions of JGroups could sometimes run into problems if this is the case. Try shutting down both JBoss instances, then start one, wait until it starts completely, then start the other. See if this makes any difference.

          Depending on which version of JGroups you downloaded to attempt to use the ViewDemo, that version may contain the resolution to this issue.

          • 2. Re: clustering problems - nodes fail to cluster
            spambob

            I've tried starting jboss separately, and that doesn't help.

            I did just discover the following: the org.jgroups.tests.McastReceiverTest fails, even though the ViewDemo app succeeds:

            bs-desktop% java -cp ".:../server/all/lib/jgroups.jar:../server/all/lib/commons-logging.jar" org.jgroups.tests.McastReceiverTest -mcast_addr 224.1.2.3 -port 5555 -bind_addr bs-desktop
            Socket=0.0.0.0/0.0.0.0:5555, bind interface=/192.168.253.47

            bs-laptop% java -cp ".:../server/all/lib/jgroups.jar:../server/all/lib/commons-logging.jar" org.jgroups.tests.McastSenderTest -mcast_addr 224.1.2.3 -port 5555 -ttl 32 -bind_addr bs-laptop
            Socket=0.0.0.0/0.0.0.0:5555, ttl=32, bind interface=/192.168.253.46
            > asdasd
            > sadasdlk


            However, ViewDemo still connects!

            bs-desktop% java -cp ".:../server/all/lib/jgroups.jar:../server/all/lib/commons-logging.jar" org.jgroups.demos.ViewDemo

            -------------------------------------------------------
            GMS: address is bs:34936
            -------------------------------------------------------
            ** New view: [bs:34936|0] [bs:34936]
            ** New view: [bs:34936|1] [bs:34936, bs-laptop:32807]

            bs-laptop% java -cp ".:../server/all/lib/jgroups.jar:../server/all/lib/commons-logging.jar" org.jgroups.demos.ViewDemo

            -------------------------------------------------------
            GMS: address is bs:32807
            -------------------------------------------------------
            ** New view: [bs-desktop:34936|1] [bs-desktop:34936, bs:32807]



            Could this be a kernel or interface configuration issue? The interfaces read:

            bs-desktop% ifconfig
            eth1 Link encap:Ethernet HWaddr 00:11:D8:43:FB:F6
            inet addr:192.168.253.47 Bcast:192.168.253.255 Mask:255.255.255.0
            UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
            RX packets:817279 errors:0 dropped:0 overruns:0 frame:0
            TX packets:590413 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:702380963 (669.8 Mb) TX bytes:90165930 (85.9 Mb)
            Interrupt:17 Memory:fba00000-0
            bs-laptop% ifconfig
            eth0 Link encap:Ethernet HWaddr 00:0F:1F:16:26:6E
            inet addr:192.168.253.46 Bcast:192.168.253.255 Mask:255.255.255.0
            UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
            RX packets:504162 errors:0 dropped:0 overruns:0 frame:0
            TX packets:192885 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:442423866 (421.9 Mb) TX bytes:20610308 (19.6 Mb)
            Interrupt:11

            Both machines are running gentoo, with 2.6.11 kernels.




            • 3. Re: clustering problems - nodes fail to cluster
              belaban

              // I start jboss with "run.sh -b 0.0.0.0 -c all".

              Lose the -b 0.0.0.0 option

              • 4. Re: clustering problems - nodes fail to cluster
                spambob

                Turns out the problem was a switch issue. What made this especially hard to figure out is that org.jgroups.demos.ViewDemo shows the machines clustering correctly. Fortunately, org.jgroups.tests.McastReceiverTest and org.jgroups.tests.McastSenderTest failed, which led me to investigate the switch settings.

                I can now get the machines to cluster successfully in jboss 3.2.5, but they fail to cluster in 3.2.7.
                We'll launch with 3.2.5 for now. Perhaps removing "-b 0.0.0.0" will fix it for 3.2.7?