-
1. Re: Nodes not join Cluster - UDP discarded Message
belaban Nov 1, 2007 12:52 PM (in response to jboss_cody)- Are the 3 additional boxes in the same subnet as the others ?
- Do they bind to correct addresses ? 127.0.0.1 is *not* one !
- Any firewalls on ? If so, turn off to see whether the JOIN is successful
- Do you use VLANs ? If so, the 3 additional boxes need to be in the same VLAN as the others. Make sure VLANs dont drop IP multicast packets
- If nothing else works, you can always fallback to TCP:TCPPING and
- list your 6 nodes in TCPPING explicitly -
2. Re: Nodes not join Cluster - UDP discarded Message
jboss_cody Nov 1, 2007 1:56 PM (in response to jboss_cody)Thank you for the reply Bela,
Before I answer your questions, I thought that I should also mention that I am using vmware Server w/Centos5 to develop my cluster.
I have successfully configured two separate VM's for my cluster. (vmware Server only allows up to 4 virtual NICs per vm)
I started 3 nodes on one vm, and they form a cluster.
Then I start 3 more nodes on the other vm and they form a cluster.
I then changed the cluster-configurations on the 2nd set of nodes, to match those of the 1st set of nodes.
I start the 1st set of nodes. (ips .11, .12, .13). I start the 2nd set of nodes (.21, .22, .23) on the other vm.
I am using UDP transport with the configurations of 'all'. I can see the traffic taking place, but my 2nd set of nodes are unable to JOIN the 1st set of nodes.
I know my problem exists in my UDP configuration, but that's as far as I've gotten.Are the 3 additional boxes in the same subnet as the others ?
-Yes, each share the same subnet addr, I simply copied the network - configurations and only made modifications to the actual host names and ip addrs.Do they bind to correct addresses ? 127.0.0.1 is *not* one !
-Yes, I have the -b option included in the startup scripts of each instance/node.Any firewalls on ? If so, turn off to see whether the JOIN is successful
-No, at the initial configuration of VMware server, we disabled SELinux and any other firewall that might interfere.Do you use VLANs ? If so, the 3 additional boxes need to be in the same VLAN as the others. Make sure VLANs dont drop IP multicast packets
-??? How can I find out this info.
P.S. I've said this before, but just to make it clear, I am a newbie to all of these concepts, so please work with me here.
Sorry for any stupid, obvious questions.If nothing else works, you can always fallback to TCP:TCPPING and
- list your 6 nodes in TCPPING explicitly
This was my next plan, but from what I've read, there is extra network traffic using this approach.
Oh yeah, Here is an excerpt from a node in the 1st set:2007-10-31 22:55:40,984 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32789] discarded message from non-member 192.168.202.21:32796, my view is [192.168.202.11:32789|0] [192.168.202.11:32789] 2007-10-31 22:55:41,516 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32789] discarded message from non-member 192.168.202.21:32796, my view is [192.168.202.11:32789|0] [192.168.202.11:32789] 2007-10-31 22:55:48,172 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32787] discarded message from non-member 192.168.202.21:32794, my view is [192.168.202.11:32787|0] [192.168.202.11:32787] 2007-10-31 22:55:48,832 WARN [org.jgroups.protocols.pbcast.NAKACK] 192.168.202.11:32787] discarded message from non-member 192.168.202.21:32794, my view is [192.168.202.11:32787|0] [192.168.202.11:32787] 2007-10-31 22:55:50,509 DEBUG [org.jboss.web.tomcat.service.session.JBossCacheManager] Looking for sessions that have expired ... 2007-10-31 22:55:53,055 WARN [org.jgroups.protocols.pbcast.GMS] merge responses from subgroup coordinators <= 1 ([]). Cancelling merge 2007-10-31 22:55:53,056 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null 2007-10-31 22:55:53,056 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null 2007-10-31 22:55:55,512 WARN [org.jgroups.protocols.pbcast.GMS] merge responses from subgroup coordinators <= 1 ([]). Cancelling merge 2007-10-31 22:55:55,513 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null 2007-10-31 22:55:55,513 ERROR [org.jgroups.protocols.pbcast.GMS] coords or merge_id == null 2007-10-31 22:56:00,514 DEBUG [org.jboss.web.tomcat.service.session.JBossCacheManager] Looking for sessions that have expired ... 2007-10-31 22:56:00,590 WARN [org.jgroups.protocols.pbcast.GMS] merge responses from subgroup coordinators <= 1 ([]). Cancelling merge
boolean isBad = (this.Network_traffic == "NOT GOOD") ;
System.out.print(isBad.toString());
->NOT GOOD
Thanks again : ) -
3. Re: Nodes not join Cluster - UDP discarded Message
jboss_cody Nov 1, 2007 2:03 PM (in response to jboss_cody)boolean isBad = (this.Network_traffic == "NOT GOOD") ;
System.out.print(isBad.toString());
->NOT GOOD
Correction:
->true -
4. Re: Nodes not join Cluster - UDP discarded Message
belaban Nov 1, 2007 5:49 PM (in response to jboss_cody)My guess would be that you have a visibility issue between your VMWare instances. I suggest you follow the instructions in section 2.8 and subsequent (http://www.jgroups.org/javagroupsnew/docs/manual/html/ch02.html#ItDoesntWork) to see whether multicast traffic between the 2 VMWare instances are received.
-
5. Re: Nodes not join Cluster - UDP discarded Message
jboss_cody Nov 2, 2007 9:58 AM (in response to jboss_cody)Ok, I've tested my connection from one vm to the other using McastSenderTest and McastReceiverTest. I used the send_on_all_interfaces, and receive_on_all_interfaces options and the responses were displayed correctly.
That leads me to believe that my cluster configurations are where the problem lies. (cluster-services.xml)
Could you direct me in solving the issue of the "NAKACK discarded message" error?
What are the issues surrounding such an error?
Once again thank you Bela... -
6. Re: Nodes not join Cluster - UDP discarded Message
belaban Nov 2, 2007 3:34 PM (in response to jboss_cody)You could try using those settings for cluster-service.xml too.
Although send_on_all_interfaces is not a good option, as it increases traffic dramatically -
7. Re: Nodes not join Cluster - UDP discarded Message
jboss_cody Nov 2, 2007 3:58 PM (in response to jboss_cody)Hello again Bela,
Thank you for the replies.
Instead of "send_on_all_interfaces", I can use "bind_addr", right?
(and that is the address of the current node...?)
I have found out that some of my network configurations were wrong...
I still do not understand why nodes on the 2nd machine are unable to JOIN the cluster. Nowhere in my logs am I seeing a "JOIN" operation taking place. It's like every message that gets sent is being discarded.
I have increased the # of initial hosts from 3 to 6, as well as increase the Join_timeout and retry_timeouts.
-initial hosts maybe?
what else must I do...? This is driving me crazy!
-any help is appreciated. -
8. Re: Nodes not join Cluster - UDP discarded Message
belaban Nov 3, 2007 4:21 AM (in response to jboss_cody)I suggest use the McastSender/ReceiverTests with a bind_addr, until they find each other, and then use that bind_addr in your config
-
9. Re: Nodes not join Cluster - UDP discarded Message
jboss_cody Nov 5, 2007 8:35 AM (in response to jboss_cody)Ok, I have found the source of my problems. I have configured each instance correctly. It was an issue with VMware Server. VMware auto-generates MAC addresses, for whatever reason, it was generating the same MAC addr. for two different machines.
I was able to change the settings so that a new MAC address was generated, and everything worked fine.
I would like to say thank you once more for all of your help. You have been very patient with my posts and I appreciate that. I take back all of the bad things that I was going to say... : )
Thanks again. -
10. Re: Nodes not join Cluster - UDP discarded Message
belaban Nov 5, 2007 8:46 AM (in response to jboss_cody)Good you found it, it is very hard to solve problems like these in general...