4 Replies Latest reply on Nov 2, 2006 9:43 PM by jyoonyang

shun=false, but

jyoonyang Nov 2, 2006 8:25 PM

Hi,

I am running some load test. I have one node that is slower. I started getting "am being shunned, will leave and rejoin group..." warning, followed by ReplicationException: rsp=sender=foomachine:4524, retval=null, received=false, suspected=false

So after reading some Wiki, I set the shun to false, but I see the same behavior as the above. Any idea?

 <attribute name="ClusterConfig">
 <config>
 <!-- UDP: if you have a multihomed machine,
 set the bind_addr attribute to the appropriate NIC IP address, e.g bind_addr="192.168.0.2"
 -->
 <!-- UDP: On Windows machines, because of the media sense feature
 being broken with multicast (even after disabling media sense)
 set the loopback attribute to true -->
 <UDP mcast_addr="228.1.2.3" mcast_port="48877"
 ip_ttl="64" ip_mcast="true"
 mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
 ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
 loopback="false"/>
 <PING timeout="2000" num_initial_members="3"
 up_thread="false" down_thread="false"/>
 <MERGE2 min_interval="10000" max_interval="20000"/>
 <FD shun="false" timeout="2000" max_tries="3" />

 <FD_SOCK/>
 <VERIFY_SUSPECT timeout="1500"
 up_thread="false" down_thread="false"/>
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800"
 max_xmit_size="8192" up_thread="false" down_thread="false"/>
 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10"
 down_thread="false"/>
 <pbcast.STABLE desired_avg_gossip="20000"
 up_thread="false" down_thread="false"/>
 <FRAG frag_size="8192"
 down_thread="false" up_thread="false"/>
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
 shun="false" print_local_addr="true"/>
 <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
 </config>

Thanks,
Jennifer

1. Re: shun=false, but

brian.stansberry Nov 2, 2006 8:45 PM (in response to jyoonyang)

Shunning and the exception sound like two symptoms of the same problem -- a machine that's overtaxed. Preventing shunning doesn't solve the underlying problem.

Your FD protocol has a very short timeout/max_tries combination. With that, a busy machine that takes a while to respond to a heartbeat (perhaps just due to a long garbage collection) will get suspected. The default recommendation for FD now is timeout="10000" max_tries="5".

What's your SyncReplTimeout setting? Bumping it up will help prevent the exception.
Actions
2. Re: shun=false, but

jyoonyang Nov 2, 2006 9:19 PM (in response to jyoonyang)

Hi Brian,

Well, the underlying problem is that there are lots of cache operation on a slow machine. The test cases may not be realistic, but I would like to understand shunning better so that we will be prepared in production.

I have two nodes in a cluster where items in the TreeCache must be replicated synchronously. It sounds like "shun" causes a slow node to be kind of ignored by the faster node. We need all the live nodes in the cluster to have replication of the TreeCache. When the client requests are load-balanced, the TreeCache data must be found in either nodes.

So if "shun=false" doesn't prevent a shun, what does it do?

Thanks again,
Jennifer
Actions
3. Re: shun=false, but

brian.stansberry Nov 2, 2006 9:36 PM (in response to jyoonyang)

You mentioned reading the wiki page -- I assume you meant http://wiki.jboss.org/wiki/Wiki.jsp?page=Shunning.

Setting shun="false" will prevent JGroups shunning the node, but that doesn't mean the performance of that node is acceptable to the JBoss Cache running on top of JGroups. When JBC replicates data to the other nodes in the cluster, it has a configurable timeout (SyncReplTimeout) that controls how long it will wait for those nodes to respond that they received and applied the replication. The ReplicationException you reported is an indication that this is happening and really is a different thing from shunning.
Actions
4. Re: shun=false, but

jyoonyang Nov 2, 2006 9:43 PM (in response to jyoonyang)

Hi Brian,

It makes sense that exception is caused by replication not occuring within SyncReplTimeout. But why was there a warning message in the log: "... am being shunned, will leave and rejoin group..."? That seem to indicate the node was "shunned".

However, reading the documentation on "shun" attribute http://www.jgroups.org/javagroupsnew/docs/manual/html/protlist.html#d0e3328 I am wondering if it means whether automatic-rejoin is allowed or not. i.e. shun=true really mean auto-rejoin=true. Is this correct?

Thanks
Actions

Go to original post