NAKACK Attack ... Shun Me the Way
snacker Aug 6, 2010 6:48 PM(pardon the pun)
We are using:
product | version |
---|---|
jboss-cache core | 3.1.0.GA |
jboss-cache pojo | 3.0.0.GA |
jgroups | 2.6.10.GA |
jboss app server | 5.1.0.GA |
We are running 2 jboss app servers on different machines (same hardware & network config).
We have had many instances where "something happens" (network or whatever... we don't know) that causes jgroups to start erroring.
(Usually this is after the servers have been running for many hours... often a week/month or more.)
We get the initial error messages, then many "NAKACK ... discarded message from non-member...".
Eventually one or more of the caches stops responding and we have to restart both instances.
So this last time it looks like this is what happened:
server 111 | server 222 | comments |
---|---|---|
... jboss server starts up and joins cluster ... | ... jboss server starts up and joins cluster ... | |
... app server, caches and clustering work normally for hours/days/weeks/months... | ... app server, caches and clustering work normally for hours/days/weeks/months... | |
GSM : failed to collect ACKS from srv222@7820, srv222@7821 and srv111@7820 | server 222 failed to collect ACKS from... itself ??? | |
recreated cluster view using srv222@7820 & 7820 and srv111@7821 | srv111@7820 is not in the list. | |
new view received from srv222 and applied | still missing it's own member (srv111@7820) | |
now rejects messages from itself @7820 | also rejects messages from srv111@7820 | |
... continues hundreds of times for many hours ... | ... continues hundreds of times for many hours ... | over ~20 hours we get 700+ of these messages (per server). |
shutdown jboss instance | ||
recreated cluster view which includes only itself (srv222@7820 & 7821) | ||
restart jboss instance | ||
recreates cluster view using srv222@7820 & 7821 and srv111@7820 | ||
new view recieved from srv222 and applied | ||
recreates cluster view using srv222@7820 & 7821 and srv111@7820 & 7821 | ||
new view recieved from srv222 and applied | ||
shutdown jboss instance | ||
recreated cluster view which includes only itself (srv111@7820 & 7821) | ||
restart jboss instance | ||
recreates cluster view using srv111@7820 & 7821 and srv222@7820 | ||
new view recieved from srv111 and applied | ||
recreates cluster view using srv111@7820 & 7821 and srv222@7820 & 7821 | ||
new view recieved from srv111 and applied | ||
... app server, caches and clustering work normally for hours/days/weeks/months... | ... app server, caches and clustering work normally for hours/days/weeks/months... |
So some things are puzzling to me:
- why is server 111 discarding messages from itself
- why isn't a new view ever created which contains both servers which includes both ports from each server (unless one of the servers is restarted)?
If we added 'shun="true"' to the FC element in the config, would that force one or both of them to recreate the cluster and by doing so add both of the servers and ports to the cluster view?
Also, what is the difference between FC.shun and GMS.shun?
It looks like both could be set to "true" to try to get the cluster to repair itself:
http://docs.jboss.org/jbossclustering/cluster_guide/5.0/draft/en-US/pdf/Clustering_Guide.pdf
http://docs.jboss.org/jbossclustering/cluster_guide/5.1/pdf/Clustering_Guide.pdf
Is there a reason why we wouldn't want to set "shun=true" for either of these?
I have attached the log files for servers 111 and 222 and the configs we use for the cache which used ports 7820 & 7821.
-
server-222.log.zip 4.8 KB
-
server-111.log.zip 3.5 KB
-
examplecache-service.xml 1.9 KB