6 Replies Latest reply on Dec 18, 2009 9:48 AM by kosiakk

Basic TCP Cluster of two nodes fails to recover from a netwo

kpandey Jun 9, 2006 2:20 AM

This was tested on both 3.2.8 sp1 and 4.0.4 GA.

Within a subnet I confiured two nodes A and B using the TCP protocol for JGroup(I'm primarily interested in WAN cluster for site failover so next step would have been to separate the nodes across WAN).

Node A is started and runs in Master mode. Node B is started and joins the DefaultPartition cluster along with node A in Slave mode. So far so good.
Now I disconnect the lan line on the Node B machine. It fails to ping Node A and thus changes its mode to Master.

Now plug the connection back in. Node B never joins the DefaultPartition cluster with node A. Node B continues to run in Master mode.

I'd expect it to join the old cluster back and reset its mode to Slave from Master. This will present a serious issue of trying to use Singleton Tasks in a cluster.

Any suggestions from anyone if the behavior I'm oberving is not because of some n/w configuration issue at Jgroup level for recovery?

Thanks
Kumar Pandey

1. Re: Basic TCP Cluster of two nodes fails to recover from a n

kpandey Jun 9, 2006 2:29 AM (in response to kpandey)

Some more details on test setup.
Node A running on XP in a NAT address of say 10.0.1.4
Node B running in Red Hat 9 in a NAT address of 10.0.1.5
Actions
2. Re: Basic TCP Cluster of two nodes fails to recover from a n

belaban Jun 9, 2006 7:50 AM (in response to kpandey)

Do you have a MERGE2 protocol in your stack ?
Actions
3. Re: Basic TCP Cluster of two nodes fails to recover from a n

kpandey Jun 9, 2006 8:20 PM (in response to kpandey)

Yes, here's my JGroup TCP settings

<TCP bind_addr="10.0.1.62" start_port="7800" loopback="true"/>
<TCPPING initial_hosts="10.0.1.62[7800],10.0.1.61[7800]" port_range="3" timeout="3500"
num_initial_members="2" up_thread="true" down_thread="true"/>
<MERGE2 min_interval="5000" max_interval="10000"/>
<FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
<VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
<pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
retransmit_timeout="3000"/>
<pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
print_local_addr="true" down_thread="true" up_thread="true"/>
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>

I have similiar setting on the other node.
Actions
4. Re: Basic TCP Cluster of two nodes fails to recover from a n

kpandey Jun 12, 2006 10:19 PM (in response to kpandey)

I have more information on this issue --

I have three node setup using TCP for JGroup. It all works fine and if I stop
a node and restart or do a kill -9 and restart oldest becomes Master and all is well.
Now while testing error condition with network I'm running into problems. So in the normal working case I have three nodes whose
DefaultPartition CurrentView is
[10.0.1.48:1099, 10.0.2.130:1099, 10.0.1.61:1099]

Now I unplug the network cable from 10.0.1.61

I see the following debug trace in 10.0.2.130

02:05:18,276 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
02:05:18,278 INFO [DefaultPartition] Suspected member: 10.0.1.48:7800 (additional data: 14 bytes)
02:05:18,280 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 3, delta: -2) : [10.0.2.130:1099]
02:05:18,281 INFO [DefaultPartition] I am (10.0.2.130:1099) received membershipChanged event:
02:05:18,281 INFO [DefaultPartition] Dead members: 2 ([10.0.1.48:1099, 10.0.1.61:1099])
02:05:18,282 INFO [DefaultPartition] New Members : 0 ([])
02:05:18,282 INFO [DefaultPartition] All Members : 1 ([10.0.2.130:1099]

I do not undersatnd why it thought 10.0.1.48 was dead as well?1.48

debug trace in 10.0.1.48 is --

9:50:43,033 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
19:50:44,611 WARN [FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
19:50:45,533 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
19:50:46,122 WARN [CoordGmsImpl] I am the coord and I'm being am suspected -- will probably leave shortly
19:50:46,132 WARN [GMS] checkSelfInclusion() failed, 10.0.1.48:7810 is not a member of view [10.0.2.130:7810|3] [10.0.2.130:7810]; discarding view
19:50:46,517 WARN [FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
19:50:48,023 WARN [GMS] checkSelfInclusion() failed, 10.0.1.48:7800 (additional data: 14 bytes) is not a member of view [10.0.2.130:7800 (additional data: 15 bytes)|3] [10.0.2.130:7800 (additional data: 15 bytes)]; discarding view
19:50:48,032 WARN [CoordGmsImpl] I am the coord and I'm being am suspected -- will probably leave shortly
19:50:48,033 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
19:50:48,034 INFO [DefaultPartition] Suspected member: vallance-lnx:7800 (additional data: 14 bytes)

Why is 10.0.1.48 a suspect?

The result is that both 10.0.1.48 and 10.0.2.130 now runs in Master mode and not in a cluster.

Upon connecting the nework cable back to 10.0.1.61 , the cluster goes thru some variance of group and finally stettles down the following view on all three views
[10.0.2.130:1099, 10.0.1.61:1099, 10.0.1.48:1099]

How do I troubleshoot this? I would expect 10.0.2.130 and 10.0.1.48 to never loose the cluser group and 10.0.1.61 tojoin at the end as the newest.

Testing on jboss-3.2.8sp1 and jdk1.5

Thanks
Kumar
Actions
5. Re: Basic TCP Cluster of two nodes fails to recover from a n

kpandey Jun 13, 2006 1:02 AM (in response to kpandey)

Here's sample of JGroup setting in 10.0.1.62. Other have similiar setting(ie all three are mentioned in initial hosts )

<TCP bind_addr="10.0.1.62" start_port="7800" loopback="true"/>
<TCPPING initial_hosts="10.0.1.62[7800],10.0.1.61[7800],10.0.2.130[7800]" port_range="3" timeout="3500"
num_initial_members="3" up_thread="true" down_thread="true"/>
<MERGE2 min_interval="5000" max_interval="10000"/>
<FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
<VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
<pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
retransmit_timeout="3000"/>
<pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
print_local_addr="true" down_thread="true" up_thread="true"/>
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
Actions
6. Re: Basic TCP Cluster of two nodes fails to recover from a netwo

kosiakk Dec 18, 2009 9:48 AM (in response to kpandey)

Well, it seems that the problem is still there!

Instead on unplugging network cable, you can suspend the process by debug breakpoint or something like that - cluster splits to independent parts and newer recovers by itself - has to restart everything.

New nodes join the two clusters simulataneously, so the mess will just grow.

Am I doing something wrong? Is it an expected behaviour?
It looks like a show-stopper issue to me...
Actions

Go to original post