13 Replies Latest reply on Nov 23, 2005 4:30 PM by smarlow

    Nodes having difficulty discovering each other

    teknokrat

      This is my setup. I am building a 4 node cluster. I have tried jboss 4.03 and 4.03sp1. The machines are windows 2003 servers and the jvm is java 5. Everything has been updated. I am using the default cluster-service.xml. All the machines are on the same network segment, no firewalls are in place. The machines can all see ( ping) one another.

      The problem I am having is that the nodes are not discovering each other. Sometimes, three of the nodes form a group but this process takes a while. The 4th node never gets detected. The network settings of the machines are identical. I have tried using ViewDemo - the results are the same. I have setup multicast receiver/sender pairs and the reception is sporadic. I have run the Draw demo and I do not get any communication between the instances.

      To add insult to injury, I have setup a similar cluster for another company which worked 'out of the box'. I am not a network engineer so I have no idea what i can troubleshoot to fix this problem. Likewise, the network engineer knows nothing about jboss clustering so doesn't know what settings to check.

      Can anyone offer some advice of what i can do to help troubleshoot this issue. I am logging all of jgroups at trace level but I am not seeing anything that looks like a problem. What network settings should I get the network engineer to examine that would affect the multicast traffic between the machines?

      thank you

        • 1. Re: Nodes having difficulty discovering each other
          teknokrat

          here is some sample log

          [org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2418 (additional data: 17 bytes), returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]]
          [org.jgroups.protocols.UDP] sending message to nbp02:2418 (additional data: 17 bytes) (src=nbp02:2418 (additional data: 17 bytes)), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]], UDP=[UDP:group_addr=ladbrokes]}
          [org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2418 (additional data: 17 bytes), src: nbp02:2418 (additional data: 17 bytes) (2 headers), size = 0 bytes]
          [org.jgroups.protocols.PING] received FIND_INITAL_MBRS_RSP, rsp=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]
          [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 1 rsps
          [org.jgroups.protocols.PING] initial mbrs are [[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]]
          [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]]
          [org.jgroups.protocols.PING] FIND_INITIAL_MBRS
          [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 0 rsps
          [org.jgroups.protocols.UDP] sending message to 230.1.2.7:45577 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[UDP:group_addr=Tomcat-Cluster]}
          [org.jgroups.protocols.UDP] looped back local message [dst: 230.1.2.7:45577, src: nbp02:2421 (2 headers), size = 0 bytes]
          [org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2421, returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]]
          [org.jgroups.protocols.UDP] received (mcast) 102 bytes from /10.21.201.18:2422 (size=102 bytes)
          [org.jgroups.protocols.UDP] discarded own loopback multicast packet
          [org.jgroups.protocols.UDP] sending message to nbp02:2421 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]], UDP=[UDP:group_addr=Tomcat-Cluster]}
          [org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2421, src: nbp02:2421 (2 headers), size = 0 bytes]
          [org.jgroups.protocols.PING] received FIND_INITAL_MBRS_RSP, rsp=[own_addr=nbp02:2421, coord_addr=nbp02:2421]
          [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 1 rsps
          [org.jgroups.protocols.PING] initial mbrs are [[own_addr=nbp02:2421, coord_addr=nbp02:2421]]
          [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=nbp02:2421, coord_addr=nbp02:2421]]
          [org.jgroups.protocols.PING] FIND_INITIAL_MBRS
          [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 0 rsps
          [org.jgroups.protocols.UDP] sending message to 228.1.2.3:45566 (src=nbp02:2418 (additional data: 17 bytes)), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[UDP:group_addr=ladbrokes]}
          [org.jgroups.protocols.UDP] looped back local message [dst: 228.1.2.3:45566, src: nbp02:2418 (additional data: 17 bytes) (2 headers), size = 0 bytes]
          [org.jgroups.protocols.UDP] received (mcast) 114 bytes from /10.21.201.18:2419 (size=114 bytes)
          [org.jgroups.protocols.UDP] discarded own loopback multicast packet
          [org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2418 (additional data: 17 bytes), returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]]
          [org.jgroups.protocols.UDP] sending message to nbp02:2418 (additional data: 17 bytes) (src=nbp02:2418 (additional data: 17 bytes)), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]], UDP=[UDP:group_addr=ladbrokes]}
          [org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2418 (additional data: 17 bytes), src: nbp02:2418 (additional data: 17 bytes) (2 headers), size = 0 bytes]
          [org.jgroups.protocols.PING] received FIND_INITAL_MBRS_RSP, rsp=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]
          [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 1 rsps
          [org.jgroups.protocols.PING] FIND_INITIAL_MBRS
          [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 0 rsps
          [org.jgroups.protocols.UDP] sending message to 230.1.2.7:45577 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[UDP:group_addr=Tomcat-Cluster]}
          [org.jgroups.protocols.UDP] looped back local message [dst: 230.1.2.7:45577, src: nbp02:2421 (2 headers), size = 0 bytes]
          [org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2421, returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]]
          [org.jgroups.protocols.UDP] received (mcast) 102 bytes from /10.21.201.18:2422 (size=102 bytes)
          [org.jgroups.protocols.UDP] discarded own loopback multicast packet
          [org.jgroups.protocols.UDP] sending message to nbp02:2421 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]], UDP=[UDP:group_addr=Tomcat-Cluster]}
          [org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2421, src: nbp02:2421 (2 headers), size = 0 bytes]
          
          


          • 2. Re: Nodes having difficulty discovering each other
            belaban

            What's you config ? Do you have an ip_ttl that is higher than 1 ?
            What's your JGroups version ?

            • 3. Re: Nodes having difficulty discovering each other
              teknokrat

              I am using the default config. I have tried out different settings but to no avail. This is the config

              <Config>
               <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" mcast_port="45566"
               ip_ttl="8" ip_mcast="true"
               mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
               ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
               loopback="true"/>
               <PING timeout="2000" num_initial_members="3"
               up_thread="true" down_thread="true"/>
               <MERGE2 min_interval="10000" max_interval="20000"/>
               <FD shun="true" up_thread="true" down_thread="true"
               timeout="2500" max_tries="5"/>
               <VERIFY_SUSPECT timeout="3000" num_msgs="3"
               up_thread="true" down_thread="true"/>
               <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
               max_xmit_size="8192"
               up_thread="true" down_thread="true"/>
               <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
               down_thread="true"/>
               <pbcast.STABLE desired_avg_gossip="20000"
               up_thread="true" down_thread="true"/>
               <FRAG frag_size="8192"
               down_thread="true" up_thread="true"/>
               <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
               shun="true" print_local_addr="true"/>
               <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
               </Config>
              


              • 4. Re: Nodes having difficulty discovering each other
                teknokrat

                JGroups version is whatever was shipped with 4.03sp1, where can i find this information?

                • 5. Re: Nodes having difficulty discovering each other
                  belaban

                  JGroups version: http://wiki.jboss.org/wiki/Wiki.jsp?page=FAQ

                  I suggest you simplify this and test whether JGroups works correctly across this cluster first: http://wiki.jboss.org/wiki/Wiki.jsp?page=TestingJBoss.

                  • 6. Re: Nodes having difficulty discovering each other
                    teknokrat

                    yes, I've tried this.

                    1. I have set up the ViewDemo test on all machines. Eventually ( like after half an hour ) three of the nodes are recognised as a cluster group. The fourth node is never detected.

                    2. I have set up a multicast receiver/sender pair on the nodes and have sent messages. Not all messages are received. Reception is sporadic but does occur on all nodes.

                    3. i have set up the Draw test on the nodes. Cannot get it to work. Drawing on one node never duplicates across to another node.

                    I would like to reiterate that I doubt very much that this is a jgroups problem. I have set up the same configuration elsewhere and everything works ok. I have scanned the machines and ports 45566, 45577 show up as open.

                    Are there any network settings that can affect multicast packets? Can any other devices interfere with the process of discovery. I know that there is a DB cluster on this network using windows clustering.

                    • 7. Re: Nodes having difficulty discovering each other
                      belaban

                      A few more suggestions:

                      - Firewall ? Check iptables -L to see whether you have any rules (Linux)
                      - Do you bind to the right interfaces ? Using either bind_addr in the XML or
                      -Dbind.address sysprop ?
                      - Use ethereal/snoop/tcpdump to see where your IP multicast packets are
                      going
                      - Do you use IP Bonding (Linux) or IP Multipathing (Solaris) ?

                      • 8. Re: Nodes having difficulty discovering each other
                        teknokrat

                        I am on windows 2003 so there is no iptables, ip bonding, ip multipathing. I have run the nodes with the sysopts -b and --host and -Dbind.address= to no avail.

                        Running ethereal shows (this is a short snippet of a typical log)

                        No. Time Source Destination Protocol Info
                        250 201.789711 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577
                        255 209.885937 10.21.201.18 228.1.2.3 UDP Source port: 1515 Destination port: 45566
                        261 219.355072 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577
                        269 230.261611 10.21.201.18 228.1.2.3 UDP Source port: 1515 Destination port: 45566
                        272 235.888214 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577
                        281 248.396233 10.21.201.18 228.1.2.3 UDP Source port: 1515 Destination port: 45566
                        287 253.354952 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577
                        


                        Nothing unusual except that no UDP packets are being received!! I have the feeling I am missing something obvious but so help me...



                        • 9. Re: Nodes having difficulty discovering each other
                          belaban

                          Are you sure that the mcast_addr and mcast_port are the same on all 4 machines ?
                          Where did you run ethereal ? This shows that on the machines on which you ran it, you do receive multicasts to 45566 and 45577 (latter is not the cluster you showed) from 10.21.201.18:1515 and 10.21.201.18:1518. We're seeing traffic from 2 clusters here.

                          • 10. Re: Nodes having difficulty discovering each other
                            teknokrat

                            I've rechecked the config files again. Both cluster-service.xml and tc5-cluster.xml have the same config on both machines and furthermore this is identical to the default config i.e. I have not changed anything except set loopback=true because these are windows machines.

                            The ethereal output above is from 10.21.201.18 which shows its sending out UDP packets. The log for 10.21.201.17 was identical. UDP packets were being sent but not received. At least not from another machine. Do you need to see more ethereal output? i figured only the UDP part would be interesting but perhaps other protocols matter too. I was seeing a lot of TCP bad checksum errors!

                            • 11. Re: Nodes having difficulty discovering each other
                              belaban

                              I give up: this is definitely a network problem, which you need to fix. It's hard to diagnose without having access to the network itself...
                              Sorry,

                              • 12. Re: Nodes having difficulty discovering each other
                                teknokrat

                                I really appreciate you taking the time provide some help. I also think its a network issue but you just try and convince the network guys this...

                                thank you very much

                                • 13. Re: Nodes having difficulty discovering each other
                                  smarlow

                                  I believe that Network switches can be configured to prevent UDP packets from being broadcast. I googled for this and found the following sales info for one network switch that implies that this is true:

                                  "Once the port is operational,the network administrator can use both regular and extended ACLs to control access to and through the network, enabling control policies that can permit or deny traffic based on a wide variety of identification characteristics, such as source/destination MAC addresses, source/destination IP addresses, and TCP/UDP ports/sockets or well-known port numbers?further protecting and restricting network access from malicious users." source link http://www.foundrynet.com/products/l23wiringcloset/fastiron/FIedgePOEDatasheet.html

                                  -Scott