Thinking at the challenge of giving one unified view of 2 physical entities, I was particularly interested in the brain split issue, and how much of a risk it is for the consistency of the cache. We need to consider 3 states: nodes form a cluster, nodes are running separated, and nodes re-form the original cluster.
When nodes are running separated it is not easy to figure out if the other node is down or not reachable. I guess you could try pinging other servers in the vicinity of the lost one, but still that would not be a proof. in terms of consistency the only things you could do are:
* consider that your partner is dead and provide as much availability as possible. if your assumption is wrong and that the other node is actually running, your risks are 1) your business logic processes with obsolete values already updated on the other side and 2) you allow updates that make values obsoletes on the other side, and you put yourself in a situation of potential conflicting updates
* you could disallow updates, then your only risk would be on using obsolete values
* you last option might be to just ignore the cache, providing you can get the value elsewhere (from a DB for instance). but then this is risky because if your partner is really dead, you might have to handle a lot more load and that exactly the situation where you would want to take advantage of a cache.
Reconnecting seperated nodes involve a merge algorithm. as far as merge strategies are concerned, I have come to think that there are 2 categories of applications that use caching in terms of requirements for consistency.
1) the first one is best effort: low probability of brain split, low frequency of updates, high read throughput requirement, non critical for business
2) the second one is consistency first: still low probability of brain split, medium frequency of updates, critical need for consistency
for the best effort case, the 'pick one node' strategy (like coherence does) is probably good enough. people take their chance. it works most of the time like a charm, and when it does not, it is not critical.
being myself in the financial industry, I place myself in the second category. we would rather not cache anything and slow down our processes, rather than calculate with obsolete values. it is better for us to make a batch fail, than to export wrong data to a DB, and find out later through reconciliation that the work is worthless. the cache has to be as much the truth as the database (for the cache aside pattern).
I think that the safest for me would be to wipe out clean all the caches. we usually have a DB or a service behind the cache, so we would ramp up slowly and gradually (if necessary we can also launch an asynch. re-warm process). of course this could be optimized if none of the caches had been updated (values read were never been obsoletes), or if all updates had been made on a single node (some values read were obsoletes, but at least there can't be any conflict with 2 concurrent updates).
I suspect that many people would react differently to caching products if they realized that in a 2 nodes cluster that gets separated, with default algorithms (like coherence of infinispan do), assuming that writes continue during the separation, 50% of the updates get lost on reconnection. of course real brain splits do not happen everyday, and usually do not last long enough for writes to occur, but damages can be important and last time: if the wrong side of the cluster is chosen, only a new put would get us out of an obsolete data. If I am using the cache aside pattern, I could be with one value in DB and one value in cache for hours if not days.
For those reasons, I could see 2 default strategies :
- something arbitrary like in coherence: pick the side of the cluster with the most members, and if equal pick the side that has the oldest member. the intent is not to have the truth but to get back into business as quickly as possible.
- wipe out (with some special case optimizations): if you cannot trust the data (obviously assuming you can do without), get rid of it. this could be coupled with an asynchronous pre-warm processus to get the performances back as quickly as possible.
What do you think?