-
1. Re: Nodes having difficulty discovering each other
teknokrat Nov 22, 2005 5:40 AM (in response to teknokrat)here is some sample log
[org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2418 (additional data: 17 bytes), returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]] [org.jgroups.protocols.UDP] sending message to nbp02:2418 (additional data: 17 bytes) (src=nbp02:2418 (additional data: 17 bytes)), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]], UDP=[UDP:group_addr=ladbrokes]} [org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2418 (additional data: 17 bytes), src: nbp02:2418 (additional data: 17 bytes) (2 headers), size = 0 bytes] [org.jgroups.protocols.PING] received FIND_INITAL_MBRS_RSP, rsp=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)] [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 1 rsps [org.jgroups.protocols.PING] initial mbrs are [[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]] [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]] [org.jgroups.protocols.PING] FIND_INITIAL_MBRS [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 0 rsps [org.jgroups.protocols.UDP] sending message to 230.1.2.7:45577 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[UDP:group_addr=Tomcat-Cluster]} [org.jgroups.protocols.UDP] looped back local message [dst: 230.1.2.7:45577, src: nbp02:2421 (2 headers), size = 0 bytes] [org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2421, returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]] [org.jgroups.protocols.UDP] received (mcast) 102 bytes from /10.21.201.18:2422 (size=102 bytes) [org.jgroups.protocols.UDP] discarded own loopback multicast packet [org.jgroups.protocols.UDP] sending message to nbp02:2421 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]], UDP=[UDP:group_addr=Tomcat-Cluster]} [org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2421, src: nbp02:2421 (2 headers), size = 0 bytes] [org.jgroups.protocols.PING] received FIND_INITAL_MBRS_RSP, rsp=[own_addr=nbp02:2421, coord_addr=nbp02:2421] [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 1 rsps [org.jgroups.protocols.PING] initial mbrs are [[own_addr=nbp02:2421, coord_addr=nbp02:2421]] [org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=nbp02:2421, coord_addr=nbp02:2421]] [org.jgroups.protocols.PING] FIND_INITIAL_MBRS [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 0 rsps [org.jgroups.protocols.UDP] sending message to 228.1.2.3:45566 (src=nbp02:2418 (additional data: 17 bytes)), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[UDP:group_addr=ladbrokes]} [org.jgroups.protocols.UDP] looped back local message [dst: 228.1.2.3:45566, src: nbp02:2418 (additional data: 17 bytes) (2 headers), size = 0 bytes] [org.jgroups.protocols.UDP] received (mcast) 114 bytes from /10.21.201.18:2419 (size=114 bytes) [org.jgroups.protocols.UDP] discarded own loopback multicast packet [org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2418 (additional data: 17 bytes), returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]] [org.jgroups.protocols.UDP] sending message to nbp02:2418 (additional data: 17 bytes) (src=nbp02:2418 (additional data: 17 bytes)), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]], UDP=[UDP:group_addr=ladbrokes]} [org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2418 (additional data: 17 bytes), src: nbp02:2418 (additional data: 17 bytes) (2 headers), size = 0 bytes] [org.jgroups.protocols.PING] received FIND_INITAL_MBRS_RSP, rsp=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)] [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 1 rsps [org.jgroups.protocols.PING] FIND_INITIAL_MBRS [org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 0 rsps [org.jgroups.protocols.UDP] sending message to 230.1.2.7:45577 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[UDP:group_addr=Tomcat-Cluster]} [org.jgroups.protocols.UDP] looped back local message [dst: 230.1.2.7:45577, src: nbp02:2421 (2 headers), size = 0 bytes] [org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2421, returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]] [org.jgroups.protocols.UDP] received (mcast) 102 bytes from /10.21.201.18:2422 (size=102 bytes) [org.jgroups.protocols.UDP] discarded own loopback multicast packet [org.jgroups.protocols.UDP] sending message to nbp02:2421 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]], UDP=[UDP:group_addr=Tomcat-Cluster]} [org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2421, src: nbp02:2421 (2 headers), size = 0 bytes]
-
2. Re: Nodes having difficulty discovering each other
belaban Nov 22, 2005 6:42 AM (in response to teknokrat)What's you config ? Do you have an ip_ttl that is higher than 1 ?
What's your JGroups version ? -
3. Re: Nodes having difficulty discovering each other
teknokrat Nov 22, 2005 7:32 AM (in response to teknokrat)I am using the default config. I have tried out different settings but to no avail. This is the config
<Config> <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" mcast_port="45566" ip_ttl="8" ip_mcast="true" mcast_send_buf_size="800000" mcast_recv_buf_size="150000" ucast_send_buf_size="800000" ucast_recv_buf_size="150000" loopback="true"/> <PING timeout="2000" num_initial_members="3" up_thread="true" down_thread="true"/> <MERGE2 min_interval="10000" max_interval="20000"/> <FD shun="true" up_thread="true" down_thread="true" timeout="2500" max_tries="5"/> <VERIFY_SUSPECT timeout="3000" num_msgs="3" up_thread="true" down_thread="true"/> <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800" max_xmit_size="8192" up_thread="true" down_thread="true"/> <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10" down_thread="true"/> <pbcast.STABLE desired_avg_gossip="20000" up_thread="true" down_thread="true"/> <FRAG frag_size="8192" down_thread="true" up_thread="true"/> <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/> <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/> </Config>
-
4. Re: Nodes having difficulty discovering each other
teknokrat Nov 22, 2005 7:36 AM (in response to teknokrat)JGroups version is whatever was shipped with 4.03sp1, where can i find this information?
-
5. Re: Nodes having difficulty discovering each other
belaban Nov 22, 2005 8:10 AM (in response to teknokrat)JGroups version: http://wiki.jboss.org/wiki/Wiki.jsp?page=FAQ
I suggest you simplify this and test whether JGroups works correctly across this cluster first: http://wiki.jboss.org/wiki/Wiki.jsp?page=TestingJBoss. -
6. Re: Nodes having difficulty discovering each other
teknokrat Nov 22, 2005 8:38 AM (in response to teknokrat)yes, I've tried this.
1. I have set up the ViewDemo test on all machines. Eventually ( like after half an hour ) three of the nodes are recognised as a cluster group. The fourth node is never detected.
2. I have set up a multicast receiver/sender pair on the nodes and have sent messages. Not all messages are received. Reception is sporadic but does occur on all nodes.
3. i have set up the Draw test on the nodes. Cannot get it to work. Drawing on one node never duplicates across to another node.
I would like to reiterate that I doubt very much that this is a jgroups problem. I have set up the same configuration elsewhere and everything works ok. I have scanned the machines and ports 45566, 45577 show up as open.
Are there any network settings that can affect multicast packets? Can any other devices interfere with the process of discovery. I know that there is a DB cluster on this network using windows clustering. -
7. Re: Nodes having difficulty discovering each other
belaban Nov 22, 2005 9:17 AM (in response to teknokrat)A few more suggestions:
- Firewall ? Check iptables -L to see whether you have any rules (Linux)
- Do you bind to the right interfaces ? Using either bind_addr in the XML or
-Dbind.address sysprop ?
- Use ethereal/snoop/tcpdump to see where your IP multicast packets are
going
- Do you use IP Bonding (Linux) or IP Multipathing (Solaris) ? -
8. Re: Nodes having difficulty discovering each other
teknokrat Nov 22, 2005 10:38 AM (in response to teknokrat)I am on windows 2003 so there is no iptables, ip bonding, ip multipathing. I have run the nodes with the sysopts -b and --host and -Dbind.address= to no avail.
Running ethereal shows (this is a short snippet of a typical log)No. Time Source Destination Protocol Info 250 201.789711 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577 255 209.885937 10.21.201.18 228.1.2.3 UDP Source port: 1515 Destination port: 45566 261 219.355072 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577 269 230.261611 10.21.201.18 228.1.2.3 UDP Source port: 1515 Destination port: 45566 272 235.888214 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577 281 248.396233 10.21.201.18 228.1.2.3 UDP Source port: 1515 Destination port: 45566 287 253.354952 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577
Nothing unusual except that no UDP packets are being received!! I have the feeling I am missing something obvious but so help me... -
9. Re: Nodes having difficulty discovering each other
belaban Nov 23, 2005 3:36 AM (in response to teknokrat)Are you sure that the mcast_addr and mcast_port are the same on all 4 machines ?
Where did you run ethereal ? This shows that on the machines on which you ran it, you do receive multicasts to 45566 and 45577 (latter is not the cluster you showed) from 10.21.201.18:1515 and 10.21.201.18:1518. We're seeing traffic from 2 clusters here. -
10. Re: Nodes having difficulty discovering each other
teknokrat Nov 23, 2005 5:47 AM (in response to teknokrat)I've rechecked the config files again. Both cluster-service.xml and tc5-cluster.xml have the same config on both machines and furthermore this is identical to the default config i.e. I have not changed anything except set loopback=true because these are windows machines.
The ethereal output above is from 10.21.201.18 which shows its sending out UDP packets. The log for 10.21.201.17 was identical. UDP packets were being sent but not received. At least not from another machine. Do you need to see more ethereal output? i figured only the UDP part would be interesting but perhaps other protocols matter too. I was seeing a lot of TCP bad checksum errors! -
11. Re: Nodes having difficulty discovering each other
belaban Nov 23, 2005 7:35 AM (in response to teknokrat)I give up: this is definitely a network problem, which you need to fix. It's hard to diagnose without having access to the network itself...
Sorry, -
12. Re: Nodes having difficulty discovering each other
teknokrat Nov 23, 2005 8:11 AM (in response to teknokrat)I really appreciate you taking the time provide some help. I also think its a network issue but you just try and convince the network guys this...
thank you very much -
13. Re: Nodes having difficulty discovering each other
smarlow Nov 23, 2005 4:30 PM (in response to teknokrat)I believe that Network switches can be configured to prevent UDP packets from being broadcast. I googled for this and found the following sales info for one network switch that implies that this is true:
"Once the port is operational,the network administrator can use both regular and extended ACLs to control access to and through the network, enabling control policies that can permit or deny traffic based on a wide variety of identification characteristics, such as source/destination MAC addresses, source/destination IP addresses, and TCP/UDP ports/sockets or well-known port numbers?further protecting and restricting network access from malicious users." source link http://www.foundrynet.com/products/l23wiringcloset/fastiron/FIedgePOEDatasheet.html
-Scott