1 Reply Latest reply on Apr 22, 2010 7:21 AM by manik

Merge strategies

vsevel Apr 15, 2010 12:28 PM

Thinking at the challenge of giving one unified view of 2 physical entities, I was particularly interested in the brain split issue, and how much of a risk it is for the consistency of the cache. We need to consider 3 states: nodes form a cluster, nodes are running separated, and nodes re-form the original cluster.

When nodes are running separated it is not easy to figure out if the other node is down or not reachable. I guess you could try pinging other servers in the vicinity of the lost one, but still that would not be a proof. in terms of consistency the only things you could do are:

* consider that your partner is dead and provide as much availability as possible. if your assumption is wrong and that the other node is actually running, your risks are 1) your business logic processes with obsolete values already updated on the other side and 2) you allow updates that make values obsoletes on the other side, and you put yourself in a situation of potential conflicting updates

* you could disallow updates, then your only risk would be on using obsolete values

* you last option might be to just ignore the cache, providing you can get the value elsewhere (from a DB for instance). but then this is risky because if your partner is really dead, you might have to handle a lot more load and that exactly the situation where you would want to take advantage of a cache.

Reconnecting seperated nodes involve a merge algorithm. as far as merge strategies are concerned, I have come to think that there are 2 categories of applications that use caching in terms of requirements for consistency.

1) the first one is best effort: low probability of brain split, low frequency of updates, high read throughput requirement, non critical for business

2) the second one is consistency first: still low probability of brain split, medium frequency of updates, critical need for consistency

for the best effort case, the 'pick one node' strategy (like coherence does) is probably good enough. people take their chance. it works most of the time like a charm, and when it does not, it is not critical.

being myself in the financial industry, I place myself in the second category. we would rather not cache anything and slow down our processes, rather than calculate with obsolete values. it is better for us to make a batch fail, than to export wrong data to a DB, and find out later through reconciliation that the work is worthless. the cache has to be as much the truth as the database (for the cache aside pattern).

I think that the safest for me would be to wipe out clean all the caches. we usually have a DB or a service behind the cache, so we would ramp up slowly and gradually (if necessary we can also launch an asynch. re-warm process). of course this could be optimized if none of the caches had been updated (values read were never been obsoletes), or if all updates had been made on a single node (some values read were obsoletes, but at least there can't be any conflict with 2 concurrent updates).

I suspect that many people would react differently to caching products if they realized that in a 2 nodes cluster that gets separated, with default algorithms (like coherence of infinispan do), assuming that writes continue during the separation, 50% of the updates get lost on reconnection. of course real brain splits do not happen everyday, and usually do not last long enough for writes to occur, but damages can be important and last time: if the wrong side of the cluster is chosen, only a new put would get us out of an obsolete data. If I am using the cache aside pattern, I could be with one value in DB and one value in cache for hours if not days.

For those reasons, I could see 2 default strategies :

- something arbitrary like in coherence: pick the side of the cluster with the most members, and if equal pick the side that has the oldest member. the intent is not to have the truth but to get back into business as quickly as possible.

- wipe out (with some special case optimizations): if you cannot trust the data (obviously assuming you can do without), get rid of it. this could be coupled with an asynchronous pre-warm processus to get the performances back as quickly as possible.

What do you think?

Vincent

1. Re: Merge strategies

manik Apr 22, 2010 7:21 AM (in response to vsevel)
Vincent Sevel wrote:

Thinking at the challenge of giving one unified view of 2 physical entities, I was particularly interested in the brain split issue, and how much of a risk it is for the consistency of the cache. We need to consider 3 states: nodes form a cluster, nodes are running separated, and nodes re-form the original cluster.

When nodes are running separated it is not easy to figure out if the other node is down or not reachable. I guess you could try pinging other servers in the vicinity of the lost one, but still that would not be a proof.

I believe JGroups' MERGE protocol does good enough a job at detecting a split brain in good time? Or do you not feel so?

in terms of consistency the only things you could do are:
* consider that your partner is dead and provide as much availability as possible. if your assumption is wrong and that the other node is actually running, your risks are 1) your business logic processes with obsolete values already updated on the other side and 2) you allow updates that make values obsoletes on the other side, and you put yourself in a situation of potential conflicting updates
* you could disallow updates, then your only risk would be on using obsolete values
* you last option might be to just ignore the cache, providing you can get the value elsewhere (from a DB for instance). but then this is risky because if your partner is really dead, you might have to handle a lot more load and that exactly the situation where you would want to take advantage of a cache.

Reconnecting seperated nodes involve a merge algorithm. as far as merge strategies are concerned, I have come to think that there are 2 categories of applications that use caching in terms of requirements for consistency.

1) the first one is best effort: low probability of brain split, low frequency of updates, high read throughput requirement, non critical for business
2) the second one is consistency first: still low probability of brain split, medium frequency of updates, critical need for consistency

for the best effort case, the 'pick one node' strategy (like coherence does) is probably good enough. people take their chance. it works most of the time like a charm, and when it does not, it is not critical.

Right, so from an Infinispan perspective, any merge handling would need to be configurable, to specify:
What to do when a split brain is detected. Either:
Shut down
Read-only
Business-as-usual (and deal with inconsistencies later)
What to do when a split-brain heals. Note that this only applies to "Business-as-usual" detection policy since the other 2 will not need to deal with merging state. Pluggable handlers, and we'd ship with the following handlers:
WipeSmallerPartitionHandler (as you say, similar to coherence)
WipeAllCaches

And of course provide for other forms of merge handlers in the future - using timestamps and shared clocks, multiple versions, etc. But for a start, I think these are quick-wins and as you suggest, good enough for most cases.

Cheers
Manik
Actions