Shunning and the exception sound like two symptoms of the same problem -- a machine that's overtaxed. Preventing shunning doesn't solve the underlying problem.
Your FD protocol has a very short timeout/max_tries combination. With that, a busy machine that takes a while to respond to a heartbeat (perhaps just due to a long garbage collection) will get suspected. The default recommendation for FD now is timeout="10000" max_tries="5".
What's your SyncReplTimeout setting? Bumping it up will help prevent the exception.
Well, the underlying problem is that there are lots of cache operation on a slow machine. The test cases may not be realistic, but I would like to understand shunning better so that we will be prepared in production.
I have two nodes in a cluster where items in the TreeCache must be replicated synchronously. It sounds like "shun" causes a slow node to be kind of ignored by the faster node. We need all the live nodes in the cluster to have replication of the TreeCache. When the client requests are load-balanced, the TreeCache data must be found in either nodes.
So if "shun=false" doesn't prevent a shun, what does it do?
You mentioned reading the wiki page -- I assume you meant http://wiki.jboss.org/wiki/Wiki.jsp?page=Shunning.
Setting shun="false" will prevent JGroups shunning the node, but that doesn't mean the performance of that node is acceptable to the JBoss Cache running on top of JGroups. When JBC replicates data to the other nodes in the cluster, it has a configurable timeout (SyncReplTimeout) that controls how long it will wait for those nodes to respond that they received and applied the replication. The ReplicationException you reported is an indication that this is happening and really is a different thing from shunning.
It makes sense that exception is caused by replication not occuring within SyncReplTimeout. But why was there a warning message in the log: "... am being shunned, will leave and rejoin group..."? That seem to indicate the node was "shunned".
However, reading the documentation on "shun" attribute http://www.jgroups.org/javagroupsnew/docs/manual/html/protlist.html#d0e3328 I am wondering if it means whether automatic-rejoin is allowed or not. i.e. shun=true really mean auto-rejoin=true. Is this correct?