-
1. Re: Basic TCP Cluster of two nodes fails to recover from a n
kpandey Jun 9, 2006 2:29 AM (in response to kpandey)Some more details on test setup.
Node A running on XP in a NAT address of say 10.0.1.4
Node B running in Red Hat 9 in a NAT address of 10.0.1.5 -
2. Re: Basic TCP Cluster of two nodes fails to recover from a n
belaban Jun 9, 2006 7:50 AM (in response to kpandey)Do you have a MERGE2 protocol in your stack ?
-
3. Re: Basic TCP Cluster of two nodes fails to recover from a n
kpandey Jun 9, 2006 8:20 PM (in response to kpandey)Yes, here's my JGroup TCP settings
<TCP bind_addr="10.0.1.62" start_port="7800" loopback="true"/>
<TCPPING initial_hosts="10.0.1.62[7800],10.0.1.61[7800]" port_range="3" timeout="3500"
num_initial_members="2" up_thread="true" down_thread="true"/>
<MERGE2 min_interval="5000" max_interval="10000"/>
<FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
<VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
<pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
retransmit_timeout="3000"/>
<pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
print_local_addr="true" down_thread="true" up_thread="true"/>
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
I have similiar setting on the other node. -
4. Re: Basic TCP Cluster of two nodes fails to recover from a n
kpandey Jun 12, 2006 10:19 PM (in response to kpandey)I have more information on this issue --
I have three node setup using TCP for JGroup. It all works fine and if I stop
a node and restart or do a kill -9 and restart oldest becomes Master and all is well.
Now while testing error condition with network I'm running into problems. So in the normal working case I have three nodes whose
DefaultPartition CurrentView is
[10.0.1.48:1099, 10.0.2.130:1099, 10.0.1.61:1099]
Now I unplug the network cable from 10.0.1.61
I see the following debug trace in 10.0.2.130
02:05:18,276 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
02:05:18,278 INFO [DefaultPartition] Suspected member: 10.0.1.48:7800 (additional data: 14 bytes)
02:05:18,280 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 3, delta: -2) : [10.0.2.130:1099]
02:05:18,281 INFO [DefaultPartition] I am (10.0.2.130:1099) received membershipChanged event:
02:05:18,281 INFO [DefaultPartition] Dead members: 2 ([10.0.1.48:1099, 10.0.1.61:1099])
02:05:18,282 INFO [DefaultPartition] New Members : 0 ([])
02:05:18,282 INFO [DefaultPartition] All Members : 1 ([10.0.2.130:1099]
I do not undersatnd why it thought 10.0.1.48 was dead as well?1.48
debug trace in 10.0.1.48 is --
9:50:43,033 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
19:50:44,611 WARN [FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
19:50:45,533 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
19:50:46,122 WARN [CoordGmsImpl] I am the coord and I'm being am suspected -- will probably leave shortly
19:50:46,132 WARN [GMS] checkSelfInclusion() failed, 10.0.1.48:7810 is not a member of view [10.0.2.130:7810|3] [10.0.2.130:7810]; discarding view
19:50:46,517 WARN [FD] I was suspected, but will not remove myself from membership (waiting for EXIT message)
19:50:48,023 WARN [GMS] checkSelfInclusion() failed, 10.0.1.48:7800 (additional data: 14 bytes) is not a member of view [10.0.2.130:7800 (additional data: 15 bytes)|3] [10.0.2.130:7800 (additional data: 15 bytes)]; discarding view
19:50:48,032 WARN [CoordGmsImpl] I am the coord and I'm being am suspected -- will probably leave shortly
19:50:48,033 INFO [DefaultPartition] Suspected member: 10.0.1.61:7800 (additional data: 14 bytes)
19:50:48,034 INFO [DefaultPartition] Suspected member: vallance-lnx:7800 (additional data: 14 bytes)
Why is 10.0.1.48 a suspect?
The result is that both 10.0.1.48 and 10.0.2.130 now runs in Master mode and not in a cluster.
Upon connecting the nework cable back to 10.0.1.61 , the cluster goes thru some variance of group and finally stettles down the following view on all three views
[10.0.2.130:1099, 10.0.1.61:1099, 10.0.1.48:1099]
How do I troubleshoot this? I would expect 10.0.2.130 and 10.0.1.48 to never loose the cluser group and 10.0.1.61 tojoin at the end as the newest.
Testing on jboss-3.2.8sp1 and jdk1.5
Thanks
Kumar -
5. Re: Basic TCP Cluster of two nodes fails to recover from a n
kpandey Jun 13, 2006 1:02 AM (in response to kpandey)Here's sample of JGroup setting in 10.0.1.62. Other have similiar setting(ie all three are mentioned in initial hosts )
<TCP bind_addr="10.0.1.62" start_port="7800" loopback="true"/>
<TCPPING initial_hosts="10.0.1.62[7800],10.0.1.61[7800],10.0.2.130[7800]" port_range="3" timeout="3500"
num_initial_members="3" up_thread="true" down_thread="true"/>
<MERGE2 min_interval="5000" max_interval="10000"/>
<FD shun="true" timeout="2500" max_tries="5" up_thread="true" down_thread="true" />
<VERIFY_SUSPECT timeout="1500" down_thread="false" up_thread="false" />
<pbcast.NAKACK down_thread="true" up_thread="true" gc_lag="100"
retransmit_timeout="3000"/>
<pbcast.STABLE desired_avg_gossip="20000" down_thread="false" up_thread="false" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="false"
print_local_addr="true" down_thread="true" up_thread="true"/>
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/> -
6. Re: Basic TCP Cluster of two nodes fails to recover from a netwo
kosiakk Dec 18, 2009 9:48 AM (in response to kpandey)Well, it seems that the problem is still there!
Instead on unplugging network cable, you can suspend the process by debug breakpoint or something like that - cluster splits to independent parts and newer recovers by itself - has to restart everything.
New nodes join the two clusters simulataneously, so the mess will just grow.
Am I doing something wrong? Is it an expected behaviour?
It looks like a show-stopper issue to me...