Just to add to what Shane has indicated.
I think the cache entries are migrated to this new node (failed node) - really depends upon the point of failure (timeout). If it is during invalidation request processing then the Node has already applied the state. The node which is invalidating keeps going it seems and finishes.
As a result, we may be in a state where we have a failed Cache node with valid entries not accessible to the rest of the cluster.
Good call Kapil.
This leads me to believe that perhaps the join process shoud be as follows.
- Send State Request
- Send Update Topology Request
- Send Invalidation Request
If the state request (rehashing) fails, then the topology should not be updated and the cluster should continue to function properly.
If the topology request fails, the cluster should continue to function properly.
If the invalidation request fails, there may be issues. I'm wondering if an invalidation request is actually necessary. Can it not be a part of the update topology request such that rather than sending a list of entries to invalidate, each node determines what entries to invalidate based on the topology change?
That will be helpful. Do you have any thoughts regarding the join process as a whole? We actually have two issues with the join process: invalidation & state transfer. Either one can timeout and fail. This will certainly eliminate the timeout problem with the invalidation. However, both issues are a result of updating the topology before completing the rehash. If either of these fail, the node enters a FAILED state. However, the join still completes 'successfully' and the node remains in the topology.
Perhaps the simplest thing to do would be to remove the node from the topology if either of these two tasks fail?
In addition to this -- what should the application layer do if the node enters a FAILED state? I didn't see any listeners that allow the cache (or cache manager) to restart.