6 Replies Latest reply on Oct 26, 2006 10:38 AM by jbirkenmaier

Shunning not working

jbirkenmaier Oct 16, 2006 12:16 PM

Hi,

I have 2 JChannels both of which have shun enabled. One channel is for JBoss itself and the other is for my PojoCache. I have a cluster consisting of {A,B,C}. I unplug the network cable for C. Each node detects the loss of the other node(s). I wait about 60 seconds then plug the cable back in. What I want is for {A,B} to continue in their own cluster and C to continue in its own (separate) cluster. In other works, A and B reject C's effort to rejoin the cluster.

What is happening is that C rejoins A and B and my cache is merged thus becoming corrupt with C data. In other words, node C is not shunned by A and B and nodes A and B aren't shunned by C. Each node is notified of the change via a merge view event. I still need to receive this notification that a node tried to rejoin the cluster. I just want the rejoin to fail without touching the cache.

Here is an excerpt from one XML config file:

jboss:service=Naming
jboss:service=TransactionManager
mycom.prox:type=Connector,transport=proxsocket
org.jboss.cache.JBossTransactionManagerLookup
OPTIMISTIC
true
true
${jboss.partition.name:DefaultPartition}
${prox.cluster.mode:LOCAL}

<UDP mcast_addr="228.1.3.4" mcast_port="48868" ip_ttl="64" ip_mcast="true"
mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
loopback="false" use_local_host="true"/>
<PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false"/>
<MERGE2 min_interval="10000" max_interval="20000"/>
<FD_SOCK down_thread="false" up_thread="false"/>
<FD timeout="1000" max_tries="8" down_thread="false" up_thread="false" shun="true"/>
<VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false"/>
<pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false" down_thread="false"/>
<UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false"/>
<pbcast.STABLE desired_avg_gossip="700" up_thread="false" down_thread="false"/>
<VIEW_SYNC avg_send_interval="60000" down_thread="false" up_thread="false" />
<pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/>
<FRAG2 frag_size="60000" down_thread="false" up_thread="false"/>
<pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>

5000

true
true

1. Re: Shunning not working

jbirkenmaier Oct 16, 2006 2:00 PM (in response to jbirkenmaier)

I am also thinking that AUTO_RECONNECT is true when I want it to be false. Since PojoCache doesn't provide access to its JGroup, how would I gain access to this to be able to change it?
Actions
2. Re: Shunning not working

jbirkenmaier Oct 18, 2006 10:49 AM (in response to jbirkenmaier)

I have gained access to the JChannel for the PojoCache and executed the following:

channel.setOpt(Channel.AUTO_GETSTATE, Boolean.FALSE);
channel.setOpt(Channel.AUTO_RECONNECT, Boolean.FALSE);

However, this doesn't seem to have any effect. When I reconnect the network cable (for node 192.168.69.122), the following is logged:

08:40:00,175 INFO [dragoneyes] (UpHandler (STATE_TRANSFER)) New cluster view for partition dragoneyes: 3 ([192.168.69.122:1099, 192.168.69.230:1099] delta: 1)
08:40:00,176 INFO [dragoneyes] (AsynchViewChangeHandler Thread) Merging partitions...
08:40:00,176 INFO [dragoneyes] (AsynchViewChangeHandler Thread) Dead members: 0
08:40:00,176 INFO [dragoneyes] (AsynchViewChangeHandler Thread) Originating groups: [[192.168.69.122:34399|2] [192.168.69.122:34399], [192.168.69.230:32821|2] [192.168.69.230:32821]]
08:40:05,753 INFO [CoreTreeCacheListener] (UpHandler (STATE_TRANSFER)) viewChange MergeView::[192.168.69.122:34402|3] [192.168.69.122:34402, 192.168.69.230:32823], subgroups=[[192.168.69.122:34402|2] [192.168.69.122:34402], [192.168.69.230:32823|2] [192.168.69.230:32823]]

The cache is still being merged for a node that is supposed to be shunned. Any ideas? Thanks.
Actions
3. Re: Shunning not working

belaban Oct 20, 2006 4:24 AM (in response to jbirkenmaier)

Shunning *always* rejoins C. To prevent this:
- disable shunning
- Remove MERGE2 from the stack

But I'm not sure I recommend this, unless you have a way of killing C, because {A,B} and {C} will see each other's traffic and discard it
Actions
4. Re: Shunning not working

jbirkenmaier Oct 23, 2006 11:28 AM (in response to jbirkenmaier)

Actually, that is the plan: when node C detected the merge, it would exit with a status code of 10 thus causing a restart of JBoss. C would then join the cluster as a new node. That part of it WAS working. When I commented out the MERGE2 entry in the xml file, it stopped the cache from merging just like I wanted but it had the unwanted effect of no longer sending the merge event. So node C no longer restarts.

You see, I used the merge event to tell me that the network was reconnected so that the node could restart itself. Is there some way to still get a notification (of any kind) without merging the cache?
Actions
5. Re: Shunning not working

belaban Oct 24, 2006 10:39 AM (in response to jbirkenmaier)

Well, you can catch the viewAccepted() callback and check whether the argument is a View (no merge) or a MergeView (merge). In the latter case, do what you need to do. To prevent the cache from handling the merge itself, you may need to subclass TreeCache and override the callback
Actions
6. Re: Shunning not working

jbirkenmaier Oct 26, 2006 10:38 AM (in response to jbirkenmaier)

What we ended up doing was to turn off the Merge completely for the cache and tap into the JChannel that JBoss uses to detect when the network goes down and comes back. By using an HAMembershipExtendedListener attached to the ClusterPartition MBean we get notification when there is a membership change in the JBoss cluster. Each node then decides whether to restart or not. This works just fine and we get no cache corruption.

Thanks for your help.
Actions

Go to original post