8 Replies Latest reply on Sep 27, 2006 4:14 AM by belaban

    Replication Problem When Nodes Have Gone Away

    jbirkenmaier

      Hi. Here's the problem in a nutshell. 3-node cluster with shared tree cache. Nodes 1 and 2 go away at around the same time (via an unplugged network cable). Node 3 gets notification withing 10-12 seconds that Node 1 is gone and makes a few changes to the cache (within a transaction). Cache tries to replicate to Node 2 (not knowing it has gone away) and fails (ReplicationException). Node 3 thinks that his local cache has been updated but it hasn't because of the replication failure. Node 3 receives notification that Node 2 has gone away after ~50 seconds and again updates his cache, which works because there is no one left to replicate to.

      There are two things I need help with:
      1. I need to have my local cache update even when it fails to replicate.
      2. Why does it take so long to receive notification that the second node has gone away when they were both on the same network cable that I unplugged? My JGroups timeout is set to 12 seconds max (counting retries). The two JGroups viewChange notifications are sometime more than 60 seconds apart.

      Thanks for the help!
      Jim