10 Replies Latest reply on Nov 5, 2007 8:46 AM by belaban

Nodes not join Cluster - UDP discarded Message

jboss_cody Oct 31, 2007 2:41 PM

Hello again,

Before I begin, let me state that I have already read each of the following links:

http://wiki.jboss.org/wiki/Wiki.jsp?page=Probe

http://wiki.jboss.org/wiki/Wiki.jsp?page=JGroupsPING

http://wiki.jboss.org/wiki/Wiki.jsp?page=TestingJBoss

http://www.jgroups.org/javagroupsnew/docs/manual/html/ch02.html#ItDoesntWork

That being said. Will anyone explicitly explain to me why my nodes will not join a existing cluster?

I have a working cluster of 3 nodes, each joining and communicating effectively.

I have created 3 more nodes which are to join the same cluster. The configurations of each node are mirror images of each other with the exception of their respective node names/ip addresses.

I have configured each node in my existing cluster from the default configuration "all". I simply made changed the PartitionName.

I understand that the moderators are very busy here, but these links only provide the top layer of information that we need.

Someone, anyone PLEASE HELP!

thanks again...

1. Re: Nodes not join Cluster - UDP discarded Message

belaban Nov 1, 2007 12:52 PM (in response to jboss_cody)

- Are the 3 additional boxes in the same subnet as the others ?
- Do they bind to correct addresses ? 127.0.0.1 is *not* one !
- Any firewalls on ? If so, turn off to see whether the JOIN is successful
- Do you use VLANs ? If so, the 3 additional boxes need to be in the same VLAN as the others. Make sure VLANs dont drop IP multicast packets
- If nothing else works, you can always fallback to TCP:TCPPING and
- list your 6 nodes in TCPPING explicitly
Actions
2. Re: Nodes not join Cluster - UDP discarded Message

jboss_cody Nov 1, 2007 1:56 PM (in response to jboss_cody)
Thank you for the reply Bela,

Before I answer your questions, I thought that I should also mention that I am using vmware Server w/Centos5 to develop my cluster.

I have successfully configured two separate VM's for my cluster. (vmware Server only allows up to 4 virtual NICs per vm)

I started 3 nodes on one vm, and they form a cluster.

Then I start 3 more nodes on the other vm and they form a cluster.

I then changed the cluster-configurations on the 2nd set of nodes, to match those of the 1st set of nodes.

I start the 1st set of nodes. (ips .11, .12, .13). I start the 2nd set of nodes (.21, .22, .23) on the other vm.

I am using UDP transport with the configurations of 'all'. I can see the traffic taking place, but my 2nd set of nodes are unable to JOIN the 1st set of nodes.

I know my problem exists in my UDP configuration, but that's as far as I've gotten.

Are the 3 additional boxes in the same subnet as the others ?

-Yes, each share the same subnet addr, I simply copied the network - configurations and only made modifications to the actual host names and ip addrs.

Do they bind to correct addresses ? 127.0.0.1 is *not* one !

-Yes, I have the -b option included in the startup scripts of each instance/node.

Any firewalls on ? If so, turn off to see whether the JOIN is successful

-No, at the initial configuration of VMware server, we disabled SELinux and any other firewall that might interfere.

Do you use VLANs ? If so, the 3 additional boxes need to be in the same VLAN as the others. Make sure VLANs dont drop IP multicast packets

-??? How can I find out this info.

P.S. I've said this before, but just to make it clear, I am a newbie to all of these concepts, so please work with me here.

Sorry for any stupid, obvious questions.

If nothing else works, you can always fallback to TCP:TCPPING and
- list your 6 nodes in TCPPING explicitly

This was my next plan, but from what I've read, there is extra network traffic using this approach.

Oh yeah, Here is an excerpt from a node in the 1st set:

2007-10-31 22:55:40,984 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32789] discarded message from non-member 192.168.202.21:32796, my view is [192.168.202.11:32789|0] [192.168.202.11:32789] 2007-10-31 22:55:41,516 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32789] discarded message from non-member 192.168.202.21:32796, my view is [192.168.202.11:32789|0] [192.168.202.11:32789] 2007-10-31 22:55:48,172 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32787] discarded message from non-member 192.168.202.21:32794, my view is [192.168.202.11:32787|0] [192.168.202.11:32787] 2007-10-31 22:55:48,832 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32787] discarded message from non-member 192.168.202.21:32794, my view is [192.168.202.11:32787|0] [192.168.202.11:32787] 2007-10-31 22:55:50,509 DEBUG [org.jboss.web.tomcat.service.session.JBossCacheManager] Looking for sessions that have expired ... 2007-10-31 22:55:53,055 WARN [org.jgroups.protocols.pbcast.GMS] merge responses from subgroup coordinators <= 1 ([]). Cancelling merge 2007-10-31 22:55:53,056 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null 2007-10-31 22:55:53,056 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null 2007-10-31 22:55:55,512 WARN [org.jgroups.protocols.pbcast.GMS] merge responses from subgroup coordinators <= 1 ([]). Cancelling merge 2007-10-31 22:55:55,513 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null 2007-10-31 22:55:55,513 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null 2007-10-31 22:56:00,514 DEBUG [org.jboss.web.tomcat.service.session.JBossCacheManager] Looking for sessions that have expired ... 2007-10-31 22:56:00,590 WARN [org.jgroups.protocols.pbcast.GMS] merge responses from subgroup coordinators <= 1 ([]). Cancelling merge

boolean isBad = (this.Network_traffic == "NOT GOOD") ;
System.out.print(isBad.toString());

->NOT GOOD

Thanks again : )
Actions
3. Re: Nodes not join Cluster - UDP discarded Message

jboss_cody Nov 1, 2007 2:03 PM (in response to jboss_cody)

boolean isBad = (this.Network_traffic == "NOT GOOD") ;
System.out.print(isBad.toString());

->NOT GOOD

Correction:

->true
Actions
4. Re: Nodes not join Cluster - UDP discarded Message

belaban Nov 1, 2007 5:49 PM (in response to jboss_cody)

My guess would be that you have a visibility issue between your VMWare instances. I suggest you follow the instructions in section 2.8 and subsequent (http://www.jgroups.org/javagroupsnew/docs/manual/html/ch02.html#ItDoesntWork) to see whether multicast traffic between the 2 VMWare instances are received.
Actions
5. Re: Nodes not join Cluster - UDP discarded Message

jboss_cody Nov 2, 2007 9:58 AM (in response to jboss_cody)

Ok, I've tested my connection from one vm to the other using McastSenderTest and McastReceiverTest. I used the send_on_all_interfaces, and receive_on_all_interfaces options and the responses were displayed correctly.

That leads me to believe that my cluster configurations are where the problem lies. (cluster-services.xml)

Could you direct me in solving the issue of the "NAKACK discarded message" error?

What are the issues surrounding such an error?

Once again thank you Bela...
Actions
6. Re: Nodes not join Cluster - UDP discarded Message

belaban Nov 2, 2007 3:34 PM (in response to jboss_cody)

You could try using those settings for cluster-service.xml too.
Although send_on_all_interfaces is not a good option, as it increases traffic dramatically
Actions
7. Re: Nodes not join Cluster - UDP discarded Message

jboss_cody Nov 2, 2007 3:58 PM (in response to jboss_cody)

Hello again Bela,

Thank you for the replies.

Instead of "send_on_all_interfaces", I can use "bind_addr", right?

(and that is the address of the current node...?)

I have found out that some of my network configurations were wrong...

I still do not understand why nodes on the 2nd machine are unable to JOIN the cluster. Nowhere in my logs am I seeing a "JOIN" operation taking place. It's like every message that gets sent is being discarded.

I have increased the # of initial hosts from 3 to 6, as well as increase the Join_timeout and retry_timeouts.

-initial hosts maybe?

what else must I do...? This is driving me crazy!

-any help is appreciated.
Actions
8. Re: Nodes not join Cluster - UDP discarded Message

belaban Nov 3, 2007 4:21 AM (in response to jboss_cody)

I suggest use the McastSender/ReceiverTests with a bind_addr, until they find each other, and then use that bind_addr in your config
Actions
9. Re: Nodes not join Cluster - UDP discarded Message

jboss_cody Nov 5, 2007 8:35 AM (in response to jboss_cody)

Ok, I have found the source of my problems. I have configured each instance correctly. It was an issue with VMware Server. VMware auto-generates MAC addresses, for whatever reason, it was generating the same MAC addr. for two different machines.

I was able to change the settings so that a new MAC address was generated, and everything worked fine.

I would like to say thank you once more for all of your help. You have been very patient with my posts and I appreciate that. I take back all of the bad things that I was going to say... : )

Thanks again.
Actions
10. Re: Nodes not join Cluster - UDP discarded Message

belaban Nov 5, 2007 8:46 AM (in response to jboss_cody)

Good you found it, it is very hard to solve problems like these in general...
Actions

Go to original post