10 Replies Latest reply on Nov 5, 2007 8:46 AM by Bela Ban

    Nodes not join Cluster - UDP discarded Message

    Cody Addison Newbie

      Hello again,

      Before I begin, let me state that I have already read each of the following links:

      http://wiki.jboss.org/wiki/Wiki.jsp?page=Probe

      http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsPING

      http://wiki.jboss.org/wiki/Wiki.jsp?page=TestingJBoss

      http://www.jgroups.org/javagroupsnew/docs/manual/html/ch02.html#ItDoesntWork

      That being said. Will anyone explicitly explain to me why my nodes will not join a existing cluster?

      I have a working cluster of 3 nodes, each joining and communicating effectively.

      I have created 3 more nodes which are to join the same cluster. The configurations of each node are mirror images of each other with the exception of their respective node names/ip addresses.

      I have configured each node in my existing cluster from the default configuration "all". I simply made changed the PartitionName.

      I understand that the moderators are very busy here, but these links only provide the top layer of information that we need.

      Someone, anyone PLEASE HELP!

      thanks again...

        • 1. Re: Nodes not join Cluster - UDP discarded Message
          Bela Ban Master

          - Are the 3 additional boxes in the same subnet as the others ?
          - Do they bind to correct addresses ? 127.0.0.1 is *not* one !
          - Any firewalls on ? If so, turn off to see whether the JOIN is successful
          - Do you use VLANs ? If so, the 3 additional boxes need to be in the same VLAN as the others. Make sure VLANs dont drop IP multicast packets
          - If nothing else works, you can always fallback to TCP:TCPPING and
          - list your 6 nodes in TCPPING explicitly

          • 2. Re: Nodes not join Cluster - UDP discarded Message
            Cody Addison Newbie

            Thank you for the reply Bela,

            Before I answer your questions, I thought that I should also mention that I am using vmware Server w/Centos5 to develop my cluster.

            I have successfully configured two separate VM's for my cluster. (vmware Server only allows up to 4 virtual NICs per vm)

            I started 3 nodes on one vm, and they form a cluster.

            Then I start 3 more nodes on the other vm and they form a cluster.

            I then changed the cluster-configurations on the 2nd set of nodes, to match those of the 1st set of nodes.

            I start the 1st set of nodes. (ips .11, .12, .13). I start the 2nd set of nodes (.21, .22, .23) on the other vm.

            I am using UDP transport with the configurations of 'all'. I can see the traffic taking place, but my 2nd set of nodes are unable to JOIN the 1st set of nodes.

            I know my problem exists in my UDP configuration, but that's as far as I've gotten.


            Are the 3 additional boxes in the same subnet as the others ?


            -Yes, each share the same subnet addr, I simply copied the network - configurations and only made modifications to the actual host names and ip addrs.

            Do they bind to correct addresses ? 127.0.0.1 is *not* one !


            -Yes, I have the -b option included in the startup scripts of each instance/node.

            Any firewalls on ? If so, turn off to see whether the JOIN is successful


            -No, at the initial configuration of VMware server, we disabled SELinux and any other firewall that might interfere.

            Do you use VLANs ? If so, the 3 additional boxes need to be in the same VLAN as the others. Make sure VLANs dont drop IP multicast packets


            -??? How can I find out this info.

            P.S. I've said this before, but just to make it clear, I am a newbie to all of these concepts, so please work with me here.

            Sorry for any stupid, obvious questions.

            If nothing else works, you can always fallback to TCP:TCPPING and
            - list your 6 nodes in TCPPING explicitly


            This was my next plan, but from what I've read, there is extra network traffic using this approach.


            Oh yeah, Here is an excerpt from a node in the 1st set:

            2007-10-31 22:55:40,984 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32789] discarded message from non-member 192.168.202.21:32796, my view is [192.168.202.11:32789|0] [192.168.202.11:32789]
            2007-10-31 22:55:41,516 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32789] discarded message from non-member 192.168.202.21:32796, my view is [192.168.202.11:32789|0] [192.168.202.11:32789]
            2007-10-31 22:55:48,172 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32787] discarded message from non-member 192.168.202.21:32794, my view is [192.168.202.11:32787|0] [192.168.202.11:32787]
            2007-10-31 22:55:48,832 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32787] discarded message from non-member 192.168.202.21:32794, my view is [192.168.202.11:32787|0] [192.168.202.11:32787]
            2007-10-31 22:55:50,509 DEBUG [org.jboss.web.tomcat.service.session.JBossCacheManager] Looking for sessions that have expired ...
            2007-10-31 22:55:53,055 WARN [org.jgroups.protocols.pbcast.GMS] merge responses from subgroup coordinators <= 1 ([]). Cancelling merge
            2007-10-31 22:55:53,056 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null
            2007-10-31 22:55:53,056 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null
            2007-10-31 22:55:55,512 WARN [org.jgroups.protocols.pbcast.GMS] merge responses from subgroup coordinators <= 1 ([]). Cancelling merge
            2007-10-31 22:55:55,513 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null
            2007-10-31 22:55:55,513 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null
            2007-10-31 22:56:00,514 DEBUG [org.jboss.web.tomcat.service.session.JBossCacheManager] Looking for sessions that have expired ...
            2007-10-31 22:56:00,590 WARN [org.jgroups.protocols.pbcast.GMS] merge responses from subgroup coordinators <= 1 ([]). Cancelling merge
            
            
            


            boolean isBad = (this.Network_traffic == "NOT GOOD") ;
            System.out.print(isBad.toString());

            ->NOT GOOD

            Thanks again : )

            • 3. Re: Nodes not join Cluster - UDP discarded Message
              Cody Addison Newbie

               

              boolean isBad = (this.Network_traffic == "NOT GOOD") ;
              System.out.print(isBad.toString());

              ->NOT GOOD


              Correction:

              ->true

              • 4. Re: Nodes not join Cluster - UDP discarded Message
                Bela Ban Master

                My guess would be that you have a visibility issue between your VMWare instances. I suggest you follow the instructions in section 2.8 and subsequent (http://www.jgroups.org/javagroupsnew/docs/manual/html/ch02.html#ItDoesntWork) to see whether multicast traffic between the 2 VMWare instances are received.

                • 5. Re: Nodes not join Cluster - UDP discarded Message
                  Cody Addison Newbie

                  Ok, I've tested my connection from one vm to the other using McastSenderTest and McastReceiverTest. I used the send_on_all_interfaces, and receive_on_all_interfaces options and the responses were displayed correctly.

                  That leads me to believe that my cluster configurations are where the problem lies. (cluster-services.xml)

                  Could you direct me in solving the issue of the "NAKACK discarded message" error?

                  What are the issues surrounding such an error?

                  Once again thank you Bela...

                  • 6. Re: Nodes not join Cluster - UDP discarded Message
                    Bela Ban Master

                    You could try using those settings for cluster-service.xml too.
                    Although send_on_all_interfaces is not a good option, as it increases traffic dramatically

                    • 7. Re: Nodes not join Cluster - UDP discarded Message
                      Cody Addison Newbie

                      Hello again Bela,

                      Thank you for the replies.

                      Instead of "send_on_all_interfaces", I can use "bind_addr", right?

                      (and that is the address of the current node...?)

                      I have found out that some of my network configurations were wrong...

                      I still do not understand why nodes on the 2nd machine are unable to JOIN the cluster. Nowhere in my logs am I seeing a "JOIN" operation taking place. It's like every message that gets sent is being discarded.

                      I have increased the # of initial hosts from 3 to 6, as well as increase the Join_timeout and retry_timeouts.

                      -initial hosts maybe?

                      what else must I do...? This is driving me crazy!

                      -any help is appreciated.

                      • 8. Re: Nodes not join Cluster - UDP discarded Message
                        Bela Ban Master

                        I suggest use the McastSender/ReceiverTests with a bind_addr, until they find each other, and then use that bind_addr in your config

                        • 9. Re: Nodes not join Cluster - UDP discarded Message
                          Cody Addison Newbie

                          Ok, I have found the source of my problems. I have configured each instance correctly. It was an issue with VMware Server. VMware auto-generates MAC addresses, for whatever reason, it was generating the same MAC addr. for two different machines.

                          I was able to change the settings so that a new MAC address was generated, and everything worked fine.

                          I would like to say thank you once more for all of your help. You have been very patient with my posts and I appreciate that. I take back all of the bad things that I was going to say... : )

                          Thanks again.

                          • 10. Re: Nodes not join Cluster - UDP discarded Message
                            Bela Ban Master

                            Good you found it, it is very hard to solve problems like these in general...