6 Replies Latest reply on Aug 10, 2010 4:43 PM by snacker

NAKACK Attack ... Shun Me the Way

snacker Aug 6, 2010 6:48 PM

(pardon the pun)

We are using:

product	version
jboss-cache core	3.1.0.GA
jboss-cache pojo	3.0.0.GA
jgroups	2.6.10.GA
jboss app server	5.1.0.GA

We are running 2 jboss app servers on different machines (same hardware & network config).

We have had many instances where "something happens" (network or whatever... we don't know) that causes jgroups to start erroring.

(Usually this is after the servers have been running for many hours... often a week/month or more.)

We get the initial error messages, then many "NAKACK ... discarded message from non-member...".

Eventually one or more of the caches stops responding and we have to restart both instances.

So this last time it looks like this is what happened:

server 111	server 222	comments
... jboss server starts up and joins cluster ...	... jboss server starts up and joins cluster ...
... app server, caches and clustering work normally for hours/days/weeks/months...	... app server, caches and clustering work normally for hours/days/weeks/months...
	GSM : failed to collect ACKS from srv222@7820, srv222@7821 and srv111@7820	server 222 failed to collect ACKS from... itself ???
	recreated cluster view using srv222@7820 & 7820 and srv111@7821	srv111@7820 is not in the list.
new view received from srv222 and applied		still missing it's own member (srv111@7820)
now rejects messages from itself @7820	also rejects messages from srv111@7820
... continues hundreds of times for many hours ...	... continues hundreds of times for many hours ...	over ~20 hours we get 700+ of these messages (per server).
shutdown jboss instance
	recreated cluster view which includes only itself (srv222@7820 & 7821)
restart jboss instance
	recreates cluster view using srv222@7820 & 7821 and srv111@7820
new view recieved from srv222 and applied
	recreates cluster view using srv222@7820 & 7821 and srv111@7820 & 7821
new view recieved from srv222 and applied
	shutdown jboss instance
recreated cluster view which includes only itself (srv111@7820 & 7821)
	restart jboss instance
recreates cluster view using srv111@7820 & 7821 and srv222@7820
	new view recieved from srv111 and applied
recreates cluster view using srv111@7820 & 7821 and srv222@7820 & 7821
	new view recieved from srv111 and applied
... app server, caches and clustering work normally for hours/days/weeks/months...	... app server, caches and clustering work normally for hours/days/weeks/months...

So some things are puzzling to me:

why is server 111 discarding messages from itself
why isn't a new view ever created which contains both servers which includes both ports from each server (unless one of the servers is restarted)?

If we added 'shun="true"' to the FC element in the config, would that force one or both of them to recreate the cluster and by doing so add both of the servers and ports to the cluster view?

Also, what is the difference between FC.shun and GMS.shun?

It looks like both could be set to "true" to try to get the cluster to repair itself:

http://docs.jboss.org/jbossclustering/cluster_guide/5.0/draft/en-US/pdf/Clustering_Guide.pdf

http://docs.jboss.org/jbossclustering/cluster_guide/5.1/pdf/Clustering_Guide.pdf

Is there a reason why we wouldn't want to set "shun=true" for either of these?

I have attached the log files for servers 111 and 222 and the configs we use for the cache which used ports 7820 & 7821.

1. Re: NAKACK Attack ... Shun Me the Way

snacker Aug 9, 2010 8:25 PM (in response to snacker)

I can reproduce this behavior if I do the following:

1) start srv111 with debugging enabled:
-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=12345

2) connect the debugger:
jdb -connect com.sun.jdi.SocketAttach:timeout=5000,port=12345,hostname=localhost

3) "suspend" all of the threads:
> suspend
All threads suspended.

4) wait for srv222 to start reporting NAKACK errors

5) "resume" the threads
> resume
All threads resumed.

6) now the views are messed up and don't include at least one port from either srv111 or srv222.

I'm going to set 'shun="true"' to see if the clusters try to reform correctly.
Actions
2. Re: NAKACK Attack ... Shun Me the Way

snacker Aug 9, 2010 5:26 PM (in response to snacker)

Setting both FD.shun = true and pbcast.GMS.shun = true gives worse behavior than before.

1) get a lot of these errors on srv222 after srv111 is resumed : socket address for srv111:7825 could not be fetched, retrying
2) whatever is using port 63842 fails to restore the view between srv222 and srv111. srv222 keeps discarding the messages from srv111.
3) I'm guessing due to #2 I'm seeing a lot of these merge errors on srv111:

[GMS] Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [srv222:56288, srv111:63842]
[GMS] merge was supposed to be cancelled at merge participant srv111:63842 (merge_id=[srv111:63842|1281382891380]), but it is not since merge ids do not match
[GMS] GMS flush by coordinator at srv111:63842 failed
[GMS] Since flush failed at srv111:63842 rejected merge to srv111:63842, merge_id=[srv111:63842|1281382891380]
[GMS] merge_id ([srv111:63842|1281382891380]) or this.merge_id (null) is null (sender=srv111:63842).

BTW:
http://docs.jboss.org/jbossclustering/cluster_guide/5.1/pdf/Clustering_Guide.pdf pg 119 states
"JGroups allows applications to configure a channel such that shunning leads to automatic rejoins and state transfer"

Does anyone know "how" this is done?
I can't find any documentation on how to configure such a channel.
Actions
3. Re: NAKACK Attack ... Shun Me the Way

belaban Aug 10, 2010 3:42 AM (in response to snacker)

Your test is the same as CTRL-Z'ing a node and then fg'ing it again. The behavior should be that the node is either shunned and rejoins, or a merge happens. I ran your test in the latest 2.6 release (2.6.16) and this worked after setting port_range="2". In some cases, e.g. when other programs are running on ports > 7820, you may even have to set port_range to an even higher value.

Note that in 2.8 and higher, shunning has been removed [1]. Also, if you have merges of overlapping partitions [2], then 2.6.x won't help you, and you'd have to upgrade to 2.8 (or better, 2.10).

2.8 and higher won't work with JBoss 5.x though, you'd have to use JBoss 6.x..

[1] http://belaban.blogspot.com/2009/06/shunning-has-been-shunned.html

[2] http://belaban.blogspot.com/2009/04/those-damn-edge-cases.html
Actions
4. Re: NAKACK Attack ... Shun Me the Way

snacker Aug 10, 2010 1:13 PM (in response to belaban)

I will try with FD.shun="false" and increasing the port_range to "2" or higher.

If I use FD.shun="true" and GMS.shun="false" the behavior is much better.
I do see the views being recreated, and eventually they settle to contain both servers with both ports.

However the problem described in #2 & #3 still happens (the ones referencing port 63842 & 56288 and the merge errors).
From what I can tell from the jmx-console these correspond to the "DefaultPartition-HAPartitionCache", "DefaultPartition-SessionCache" and "DefaultPartition".
The port numbers are random and appear to be assigned when the jboss server starts.
However, once these start erroring (i.e. one of them starts discarding messages) the view never gets updated, and "PingDest" displayed in the jmx-console is null.
Further the jmx-console shows "Shun" == true for these... but apparently is not working in this case.
The only way to get the view corrected is to restart the jboss instance whose messages are being discarded.
Is this configured in ./deploy/cluster/jgroups-channelfactory.sar/META-INF/jgroups-channelfactory-stacks.xml ??

Thanks for your help!

(BTW, our jboss instances are started under a different user name, that is why I use jdb to suspend the threads instead of ctrl-z/fg)
Actions
5. Re: NAKACK Attack ... Shun Me the Way

belaban Aug 10, 2010 1:37 PM (in response to snacker)

If you have a JGroups standalone example that fails, e.g. with org.jgroups.demos.Draw, I'll take a look. (Note that I tested under 2.6.16).
Actions
6. Re: NAKACK Attack ... Shun Me the Way

snacker Aug 10, 2010 4:43 PM (in response to belaban)

I tried FD.shun="false" and port_range="2", but it gave messages I hadn't seen before:

server 111:
[NAKACK] (requester=svr222:7826, local_addr=svr111:7825) message svr111:7825::57 not found in retransmission table of svr111:7825:[64 : 86 (86) (size=22, missing=0, highest stability=64)]

server 222:
[NAKACK] (requester=srv222:7826, local_addr=srv222:7825) message srv222:7825::282 not found in retransmission table of srv222:7825:[334 : 340 (340) (size=6, missing=0, highest stability=334)]

I will see if I can try one of the standalone examples.
This is on a headless server, so I'm not sure if I'll be able to try the "Draw" example.

BTW, there is no pre-compiled jgroups.jar for 2.6.16, right?
It looks like I need to get the source from here and build it locally...?
http://javagroups.cvs.sourceforge.net/viewvc/javagroups/JGroups/?pathrev=JGroups_2_6_16

Is 2.6.16 ok to use in a "production" environment?
Actions

Go to original post