13 Replies Latest reply on Nov 23, 2005 4:30 PM by smarlow

Nodes having difficulty discovering each other

teknokrat Nov 22, 2005 5:32 AM

This is my setup. I am building a 4 node cluster. I have tried jboss 4.03 and 4.03sp1. The machines are windows 2003 servers and the jvm is java 5. Everything has been updated. I am using the default cluster-service.xml. All the machines are on the same network segment, no firewalls are in place. The machines can all see ( ping) one another.

The problem I am having is that the nodes are not discovering each other. Sometimes, three of the nodes form a group but this process takes a while. The 4th node never gets detected. The network settings of the machines are identical. I have tried using ViewDemo - the results are the same. I have setup multicast receiver/sender pairs and the reception is sporadic. I have run the Draw demo and I do not get any communication between the instances.

To add insult to injury, I have setup a similar cluster for another company which worked 'out of the box'. I am not a network engineer so I have no idea what i can troubleshoot to fix this problem. Likewise, the network engineer knows nothing about jboss clustering so doesn't know what settings to check.

Can anyone offer some advice of what i can do to help troubleshoot this issue. I am logging all of jgroups at trace level but I am not seeing anything that looks like a problem. What network settings should I get the network engineer to examine that would affect the multicast traffic between the machines?

thank you

1. Re: Nodes having difficulty discovering each other

teknokrat Nov 22, 2005 5:40 AM (in response to teknokrat)

here is some sample log

[org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2418 (additional data: 17 bytes), returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]]
[org.jgroups.protocols.UDP] sending message to nbp02:2418 (additional data: 17 bytes) (src=nbp02:2418 (additional data: 17 bytes)), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]], UDP=[UDP:group_addr=ladbrokes]}
[org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2418 (additional data: 17 bytes), src: nbp02:2418 (additional data: 17 bytes) (2 headers), size = 0 bytes]
[org.jgroups.protocols.PING] received FIND_INITAL_MBRS_RSP, rsp=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]
[org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 1 rsps
[org.jgroups.protocols.PING] initial mbrs are [[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]]
[org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]]
[org.jgroups.protocols.PING] FIND_INITIAL_MBRS
[org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 0 rsps
[org.jgroups.protocols.UDP] sending message to 230.1.2.7:45577 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[UDP:group_addr=Tomcat-Cluster]}
[org.jgroups.protocols.UDP] looped back local message [dst: 230.1.2.7:45577, src: nbp02:2421 (2 headers), size = 0 bytes]
[org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2421, returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]]
[org.jgroups.protocols.UDP] received (mcast) 102 bytes from /10.21.201.18:2422 (size=102 bytes)
[org.jgroups.protocols.UDP] discarded own loopback multicast packet
[org.jgroups.protocols.UDP] sending message to nbp02:2421 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]], UDP=[UDP:group_addr=Tomcat-Cluster]}
[org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2421, src: nbp02:2421 (2 headers), size = 0 bytes]
[org.jgroups.protocols.PING] received FIND_INITAL_MBRS_RSP, rsp=[own_addr=nbp02:2421, coord_addr=nbp02:2421]
[org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 1 rsps
[org.jgroups.protocols.PING] initial mbrs are [[own_addr=nbp02:2421, coord_addr=nbp02:2421]]
[org.jgroups.protocols.MERGE2] initial_mbrs=[[own_addr=nbp02:2421, coord_addr=nbp02:2421]]
[org.jgroups.protocols.PING] FIND_INITIAL_MBRS
[org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 0 rsps
[org.jgroups.protocols.UDP] sending message to 228.1.2.3:45566 (src=nbp02:2418 (additional data: 17 bytes)), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[UDP:group_addr=ladbrokes]}
[org.jgroups.protocols.UDP] looped back local message [dst: 228.1.2.3:45566, src: nbp02:2418 (additional data: 17 bytes) (2 headers), size = 0 bytes]
[org.jgroups.protocols.UDP] received (mcast) 114 bytes from /10.21.201.18:2419 (size=114 bytes)
[org.jgroups.protocols.UDP] discarded own loopback multicast packet
[org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2418 (additional data: 17 bytes), returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]]
[org.jgroups.protocols.UDP] sending message to nbp02:2418 (additional data: 17 bytes) (src=nbp02:2418 (additional data: 17 bytes)), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]], UDP=[UDP:group_addr=ladbrokes]}
[org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2418 (additional data: 17 bytes), src: nbp02:2418 (additional data: 17 bytes) (2 headers), size = 0 bytes]
[org.jgroups.protocols.PING] received FIND_INITAL_MBRS_RSP, rsp=[own_addr=nbp02:2418 (additional data: 17 bytes), coord_addr=nbp02:2418 (additional data: 17 bytes)]
[org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 1 rsps
[org.jgroups.protocols.PING] FIND_INITIAL_MBRS
[org.jgroups.protocols.PING] waiting for initial members: time_to_wait=2000, got 0 rsps
[org.jgroups.protocols.UDP] sending message to 230.1.2.7:45577 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_REQ, arg=null], UDP=[UDP:group_addr=Tomcat-Cluster]}
[org.jgroups.protocols.UDP] looped back local message [dst: 230.1.2.7:45577, src: nbp02:2421 (2 headers), size = 0 bytes]
[org.jgroups.protocols.PING] received GET_MBRS_REQ from nbp02:2421, returning [PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]]
[org.jgroups.protocols.UDP] received (mcast) 102 bytes from /10.21.201.18:2422 (size=102 bytes)
[org.jgroups.protocols.UDP] discarded own loopback multicast packet
[org.jgroups.protocols.UDP] sending message to nbp02:2421 (src=nbp02:2421), headers are {PING=[PING: type=GET_MBRS_RSP, arg=[own_addr=nbp02:2421, coord_addr=nbp02:2421]], UDP=[UDP:group_addr=Tomcat-Cluster]}
[org.jgroups.protocols.UDP] looped back local message [dst: nbp02:2421, src: nbp02:2421 (2 headers), size = 0 bytes]

2. Re: Nodes having difficulty discovering each other

belaban Nov 22, 2005 6:42 AM (in response to teknokrat)

What's you config ? Do you have an ip_ttl that is higher than 1 ?
What's your JGroups version ?
Actions

3. Re: Nodes having difficulty discovering each other

teknokrat Nov 22, 2005 7:32 AM (in response to teknokrat)

I am using the default config. I have tried out different settings but to no avail. This is the config

<Config>
 <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" mcast_port="45566"
 ip_ttl="8" ip_mcast="true"
 mcast_send_buf_size="800000" mcast_recv_buf_size="150000"
 ucast_send_buf_size="800000" ucast_recv_buf_size="150000"
 loopback="true"/>
 <PING timeout="2000" num_initial_members="3"
 up_thread="true" down_thread="true"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <FD shun="true" up_thread="true" down_thread="true"
 timeout="2500" max_tries="5"/>
 <VERIFY_SUSPECT timeout="3000" num_msgs="3"
 up_thread="true" down_thread="true"/>
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800"
 max_xmit_size="8192"
 up_thread="true" down_thread="true"/>
 <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10"
 down_thread="true"/>
 <pbcast.STABLE desired_avg_gossip="20000"
 up_thread="true" down_thread="true"/>
 <FRAG frag_size="8192"
 down_thread="true" up_thread="true"/>
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
 shun="true" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 </Config>

4. Re: Nodes having difficulty discovering each other

teknokrat Nov 22, 2005 7:36 AM (in response to teknokrat)

JGroups version is whatever was shipped with 4.03sp1, where can i find this information?
Actions
5. Re: Nodes having difficulty discovering each other

belaban Nov 22, 2005 8:10 AM (in response to teknokrat)

JGroups version: http://wiki.jboss.org/wiki/Wiki.jsp?page=FAQ

I suggest you simplify this and test whether JGroups works correctly across this cluster first: http://wiki.jboss.org/wiki/Wiki.jsp?page=TestingJBoss.
Actions
6. Re: Nodes having difficulty discovering each other

teknokrat Nov 22, 2005 8:38 AM (in response to teknokrat)

yes, I've tried this.

1. I have set up the ViewDemo test on all machines. Eventually ( like after half an hour ) three of the nodes are recognised as a cluster group. The fourth node is never detected.

2. I have set up a multicast receiver/sender pair on the nodes and have sent messages. Not all messages are received. Reception is sporadic but does occur on all nodes.

3. i have set up the Draw test on the nodes. Cannot get it to work. Drawing on one node never duplicates across to another node.

I would like to reiterate that I doubt very much that this is a jgroups problem. I have set up the same configuration elsewhere and everything works ok. I have scanned the machines and ports 45566, 45577 show up as open.

Are there any network settings that can affect multicast packets? Can any other devices interfere with the process of discovery. I know that there is a DB cluster on this network using windows clustering.
Actions
7. Re: Nodes having difficulty discovering each other

belaban Nov 22, 2005 9:17 AM (in response to teknokrat)

A few more suggestions:

- Firewall ? Check iptables -L to see whether you have any rules (Linux)
- Do you bind to the right interfaces ? Using either bind_addr in the XML or
-Dbind.address sysprop ?
- Use ethereal/snoop/tcpdump to see where your IP multicast packets are
going
- Do you use IP Bonding (Linux) or IP Multipathing (Solaris) ?
Actions

8. Re: Nodes having difficulty discovering each other

teknokrat Nov 22, 2005 10:38 AM (in response to teknokrat)

I am on windows 2003 so there is no iptables, ip bonding, ip multipathing. I have run the nodes with the sysopts -b and --host and -Dbind.address= to no avail.

Running ethereal shows (this is a short snippet of a typical log)

No. Time Source Destination Protocol Info
250 201.789711 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577
255 209.885937 10.21.201.18 228.1.2.3 UDP Source port: 1515 Destination port: 45566
261 219.355072 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577
269 230.261611 10.21.201.18 228.1.2.3 UDP Source port: 1515 Destination port: 45566
272 235.888214 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577
281 248.396233 10.21.201.18 228.1.2.3 UDP Source port: 1515 Destination port: 45566
287 253.354952 10.21.201.18 230.1.2.7 UDP Source port: 1518 Destination port: 45577

Nothing unusual except that no UDP packets are being received!! I have the feeling I am missing something obvious but so help me...

9. Re: Nodes having difficulty discovering each other

belaban Nov 23, 2005 3:36 AM (in response to teknokrat)

Are you sure that the mcast_addr and mcast_port are the same on all 4 machines ?
Where did you run ethereal ? This shows that on the machines on which you ran it, you do receive multicasts to 45566 and 45577 (latter is not the cluster you showed) from 10.21.201.18:1515 and 10.21.201.18:1518. We're seeing traffic from 2 clusters here.
Actions
10. Re: Nodes having difficulty discovering each other

teknokrat Nov 23, 2005 5:47 AM (in response to teknokrat)

I've rechecked the config files again. Both cluster-service.xml and tc5-cluster.xml have the same config on both machines and furthermore this is identical to the default config i.e. I have not changed anything except set loopback=true because these are windows machines.

The ethereal output above is from 10.21.201.18 which shows its sending out UDP packets. The log for 10.21.201.17 was identical. UDP packets were being sent but not received. At least not from another machine. Do you need to see more ethereal output? i figured only the UDP part would be interesting but perhaps other protocols matter too. I was seeing a lot of TCP bad checksum errors!
Actions
11. Re: Nodes having difficulty discovering each other

belaban Nov 23, 2005 7:35 AM (in response to teknokrat)

I give up: this is definitely a network problem, which you need to fix. It's hard to diagnose without having access to the network itself...
Sorry,
Actions
12. Re: Nodes having difficulty discovering each other

teknokrat Nov 23, 2005 8:11 AM (in response to teknokrat)

I really appreciate you taking the time provide some help. I also think its a network issue but you just try and convince the network guys this...

thank you very much
Actions
13. Re: Nodes having difficulty discovering each other

smarlow Nov 23, 2005 4:30 PM (in response to teknokrat)

I believe that Network switches can be configured to prevent UDP packets from being broadcast. I googled for this and found the following sales info for one network switch that implies that this is true:

"Once the port is operational,the network administrator can use both regular and extended ACLs to control access to and through the network, enabling control policies that can permit or deny traffic based on a wide variety of identification characteristics, such as source/destination MAC addresses, source/destination IP addresses, and TCP/UDP ports/sockets or well-known port numbers?further protecting and restricting network access from malicious users." source link http://www.foundrynet.com/products/l23wiringcloset/fastiron/FIedgePOEDatasheet.html

-Scott
Actions

Go to original post