I have a 32 nodes Weblogic cluster divided into 4 differente machines: I use TreeCache as distributed cache but, for unknown network reasons, every node can see only nodes on the same machine; from a point of view you can see it as 4 different caches with 8 nodes each.
A morning I saw strange logs (well I saw no logs) on nodes of a machine: only the nodes on a machine were silent, the other 24 were working regularly.
As windows teach: don't worry.. simply restart the malfunctioning nodes ;)
Well... the nodes did not came up: they were all stuck during the deploy of the webApp that uses TreeCache.
The only repetitive log from each node was:
2007-03-08 10:22:17,890 WARN - join(10.0.0.5:47656) sent to 10.0.0.5:32768 timed out, retrying
Have you tested multicast on the networks using one of the JGroups demo programs? (See docs on www.jgroups.org for troubleshooting to make sure the JGroups channels on each node can see the cluster)