8 Replies Latest reply on Sep 27, 2006 4:14 AM by belaban

Replication Problem When Nodes Have Gone Away

jbirkenmaier Sep 22, 2006 4:56 PM

Hi. Here's the problem in a nutshell. 3-node cluster with shared tree cache. Nodes 1 and 2 go away at around the same time (via an unplugged network cable). Node 3 gets notification withing 10-12 seconds that Node 1 is gone and makes a few changes to the cache (within a transaction). Cache tries to replicate to Node 2 (not knowing it has gone away) and fails (ReplicationException). Node 3 thinks that his local cache has been updated but it hasn't because of the replication failure. Node 3 receives notification that Node 2 has gone away after ~50 seconds and again updates his cache, which works because there is no one left to replicate to.

There are two things I need help with:
1. I need to have my local cache update even when it fails to replicate.
2. Why does it take so long to receive notification that the second node has gone away when they were both on the same network cable that I unplugged? My JGroups timeout is set to 12 seconds max (counting retries). The two JGroups viewChange notifications are sometime more than 60 seconds apart.

Thanks for the help!
Jim

1. Re: Replication Problem When Nodes Have Gone Away

ben.wang Sep 23, 2006 2:49 AM (in response to jbirkenmaier)

1. You can use REPL_ASYNC CacheMode to achieve update on the local cache regardless whether remote nodes succeed or not.

2. You should tweak the JGroups FD settings to get the desired scenario that you want to have. See http://wiki.jboss.org/wiki/Wiki.jsp?page=FDVersusFD_SOCK
Actions
2. Re: Replication Problem When Nodes Have Gone Away

jbirkenmaier Sep 25, 2006 10:04 AM (in response to jbirkenmaier)

Is there any way to solve #1 by using REPL_SYNC?
Actions
3. Re: Replication Problem When Nodes Have Gone Away

jbirkenmaier Sep 25, 2006 3:54 PM (in response to jbirkenmaier)

Hi. Exactly which JGroups FD settings should be tweaked?
Actions
4. Re: Replication Problem When Nodes Have Gone Away

brian.stansberry Sep 25, 2006 5:09 PM (in response to jbirkenmaier)

Suggest you have a look at http://wiki.jboss.org/wiki/Wiki.jsp?page=FDVersusFD_SOCK, particularly the last bit.

I suspect the reason it's taking a long time to get the 2nd notification is that your code that handles the first notification is blocking the JGroups thread that sends up view change notifications. If you're not doing this, when you get a view change notification, spawn a thread and do your puts into the cache in that thread.

Also, beginning in 1.3.0, JBC introduced a new class org.jboss.cache.config.Option, which can be passed to overloaded versions of the main api calls (e.g. get, put, remove). You can create an Option with property "cacheModeLocal" set to "true"; if you pass that to a put or remove, the call will not replicate. See the Option class javadocs for more.
Actions
5. Re: Replication Problem When Nodes Have Gone Away

jbirkenmaier Sep 26, 2006 9:46 AM (in response to jbirkenmaier)

How would all of this apply while I am using a PojoCacheMBean?
Actions
6. Re: Replication Problem When Nodes Have Gone Away

brian.stansberry Sep 26, 2006 10:29 AM (in response to jbirkenmaier)

The FD/FD_SOCK stuff would be the same, as would be the concept of handlin g the view change notification asynchronously.

But, I think your question was more directed toward the Option API, and it looks like PojoCacheMBean doesn't expose overloaded getObject(). putObject(), removeObject() methods that take an Option. So that doesn't help you :(.
Actions
7. Re: Replication Problem When Nodes Have Gone Away

jbirkenmaier Sep 26, 2006 10:54 AM (in response to jbirkenmaier)

I already use a separate thread to handle the viewChange event. The event processing time takes less than 1 millisecond so that isn't the problem. The documentation states that in membership {A,B,C}, A connects to B, B connects to C, and C connects back to A. If A goes away, C will know it rather quickly but what about B? It connects to C which is still there. How does B learn about A's disappearance? By the way, I am using the TCP KEEP_ALIVE with FD_SOCK and FD timeout.
Actions
8. Re: Replication Problem When Nodes Have Gone Away

belaban Sep 27, 2006 4:14 AM (in response to jbirkenmaier)

"jbirkenmaier" wrote:
The documentation states that in membership {A,B,C}, A connects to B, B connects to C, and C connects back to A. If A goes away, C will know it rather quickly but what about B?

- C sends a SUSPECT(A) message to the group
- B notes that it is the new coordinator
- B becomes coordinator and sends a VIEW{B,C} to the group
- Therefore C gets the view as well and can act on it
This should take only a few milliseconds from the SUSPECT to the VIEW events
Actions

Go to original post