4 Replies Latest reply on Apr 15, 2011 12:55 PM by wdfink

    Clustering: Master-Slave Identification Problem

    nprasanna

      Hi,

           I have Jboss EAP 5.1 on 2 Linux Virtual Machines. I wanted to test clustering.

           Without modifying any configuration I started the servers with the following commands:

       

      First Machine:

      sh run.sh -c Production -g MyCluster -u 239.255.1.7 -b 192.168.41.132 -Djboss.messaging.ServerPeerID=1

          

      When the first one was up and running I started the second one:

      sh run.sh -c Production -g MyCluster -u 239.255.1.7 -b 192.168.41.134 -Djboss.messaging.ServerPeerID=2

       

      The Problem:

      When the second server is also up and running, in server.log of the 1st server, I get the following repeated warning messages:

       

      2011-04-13 09:10:32,087 WARN  [org.jgroups.protocols.pbcast.NAKACK] (OOB-19,192.168.41.132:55200) 192.168.41.132:55200] discarded message from non-member 192.168.41.134:55200, my view is [192.168.41.132:55200|0] [192.168.41.132:55200]

       

      The second server's log apparently has started it's own Cluster I guess as seen in the log. It identifies itself as the first and only member of its cluster. I think it indicates that both are behaving as Masters, not identifying each others, maybe.

       

      The Weird Part:

           If I start the second Server first and then the first server, the situation is surprisingly not reversed!!  I still get the repeated warnings mentioned above only in the first server's log.

       

      The various multicast ports that I tested with were:

      224.0.0.0 through 224.0.0.255 and also 239.255.x.y

       

      Kindly help and clarify.

        • 1. Clustering: Master-Slave Identification Problem
          nprasanna

          Hi,

           

          This is an urgent problem. So kindly throw some light on it

           

          Thanks.

          • 2. Clustering: Master-Slave Identification Problem
            wdfink

            Looks like a multicast problem.

            I've answered different threads here (search might help), see http://community.jboss.org/thread/165144

             

            But you should read the wiki and test the multicast functionality, see

            http://http://community.jboss.org/wiki/TestingJBoss

            • 3. Clustering: Master-Slave Identification Problem
              nprasanna

              Hi Wolf-Dieter Fink,

                  Thanks for the reply. I tried the jgroups test that you've suggested http://community.jboss.org/wiki/TestingJBoss. I ran it in the 2 linux VM machines as suggested. But the result was the same as I had described in my initial post. It is this:

               

              Machine A  was started First and when the jgroups test command executed establishing a cluster successfully, I started the Machine B. Machine B was identified by Machine A. But not as  a cluster member. It threw repeated warning messages similar to the one I had mentioned in my first post:

               

              org.jgroups.protocols.pbcast.NAKACK handleMessage WARNING: Machine A's ip:32770] discarded message from non-member Machine B's ip:32770, my view is [Machine A's ip32770|0] [Machine A's ip:32770]

               

              - Machine A's ip - 192.168.41.132 Machine B's ip - 192.168.41.134

               

              NOW, if I start Machine B first and then Machine A, the output is not as expected. The same warning messages appear in Machine A, not in Machine B. Machine B just says it has started a cluster of its own.

               

              There was another chat-sort of test mentioned in this page:  http://www.jgroups.org/manual/html/ch02.html

              I ran this test too:

              Machine A was the receiver: The command was

              /usr/lib/jvm/java/bin/java -cp lib/concurrent.jar:server/testprofile/lib/jgroups.jar:common/lib/commons-logging.jar org.jgroups.tests.McastReceiverTest -mcast_addr 224.10.10.10 -port 5555 -bind_addr 192.168.41.132

               

              Machine B was the sender: The command:

              /usr/lib/jvm/java/bin/java -cp lib/concurrent.jar:server/testprofile/lib/jgroups.jar:common/lib/commons-logging.jar org.jgroups.tests.McastSenderTest -mcast_addr 224.10.10.10 -port 5555 -bind_addr 192.168.41.134

               

              Whatever I sent from B was well received by A. But when I made A as the sender, B didn't receive it at all!! Talk about well-receiving!  It's the same problem as before.

               

              I tried the tests with -Djava.net.preferIPv4Stack=true, -ttl=32(in the sender) as well. Still no luck.

               

              I checked the wiki(http://community.jboss.org/wiki/JGroups) and the faq (http://community.jboss.org/docs/DOC-9730) pages. But to no avail.

               

              I guess the problem will be in the underlying network architecture.  So I'm giving the ifconfig o/p of both the machines here hoping it might help you in the diagnosis.

               

              Machine A:

              eth0      Link encap:Ethernet  HWaddr 00:0C:29:64:91:1E          

                          inet addr:192.168.41.132  Bcast:192.168.41.255  Mask:255.255.255.0          

                          inet6 addr: fe80::20c:29ff:fe64:911e/64 Scope:Link        

                          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1          

                          RX packets:2079 errors:0 dropped:0 overruns:0 frame:0          

                          TX packets:1784 errors:0 dropped:0 overruns:0 carrier:0          

                          collisions:0 txqueuelen:1000          

                          RX bytes:291105 (284.2 KiB)  TX bytes:232352 (226.9 KiB)          

                          Interrupt:67 Base address:0x2024

               

              lo        Link encap:Local Loopback           

                         inet addr:127.0.0.1  Mask:255.0.0.0          

                         inet6 addr: ::1/128 Scope:Host          

                         UP LOOPBACK RUNNING  MTU:16436  Metric:1          

                         RX packets:260 errors:0 dropped:0 overruns:0 frame:0          

                         TX packets:260 errors:0 dropped:0 overruns:0 carrier:0          

                         collisions:0 txqueuelen:0          

                         RX bytes:24688 (24.1 KiB)  TX bytes:24688 (24.1 KiB)

               

              Machine B:  

              eth0      Link encap:Ethernet  HWaddr 00:0C:29:0F:98:33           

                           inet addr:192.168.41.134  Bcast:192.168.41.255  Mask:255.255.255.0          

                           inet6 addr: fe80::20c:29ff:fe0f:9833/64 Scope:Link          

                           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1          

                           RX packets:2058 errors:0 dropped:0 overruns:0 frame:0          

                           TX packets:699 errors:0 dropped:0 overruns:0 carrier:0          

                           collisions:0 txqueuelen:1000          

                           RX bytes:280247 (273.6 KiB)  TX bytes:95707 (93.4 KiB)          

                           Interrupt:67 Base address:0x2024

               

              lo        Link encap:Local Loopback           

                         inet addr:127.0.0.1  Mask:255.0.0.0          

                         inet6 addr: ::1/128 Scope:Host          

                         UP LOOPBACK RUNNING  MTU:16436  Metric:1          

                         RX packets:45 errors:0 dropped:0 overruns:0 frame:0          

                         TX packets:45 errors:0 dropped:0 overruns:0 carrier:0          

                         collisions:0 txqueuelen:0         

                         RX bytes:2648 (2.5 KiB)  TX bytes:2648 (2.5 KiB)

               

              Thanks in Advance.

              • 4. Clustering: Master-Slave Identification Problem
                wdfink

                If you start the jgroups test twice at one system does it work?

                 

                I suppose you are right and the network will be the problem. But I'm not such familar with the mcast configuration.

                What you might test is to use -b 0.0.0.0 or change the mcast_addr to a different from 224* to 239* I think.