2 Replies Latest reply on Mar 9, 2011 1:45 PM by shane_dev

Lazy / On Demand State Transfer

shane_dev Mar 3, 2011 2:00 PM

This came up during some discussion on eviction and the cost of rehashing. The thought is that instead of performing a state transfer at the time of a join, simply fetch the entries on demand. This is very similar to the suggestion made earlier about eviction and the cache cluster loader.

It seems that we can simply disable rehash, and rely on a mechanism such as the cluster cache loader to fetch 'should be owned' entries from the 'original owning' server when necessary.

For example,

Node 1
- Primary Owner Of: A, B
- Secondary Owner Of: E, F
Node 2
- Primary Owner Of: C, D
- Secondary Owner Of: A, B
Node 4
- Primary Owner Of: E, F
- Secondary Owner Of: C, D

Now we introduce Node 3. If rehashing is disabled it does not have say E from Node 4 that it now owns. Nor does it have C and D from Node 2 as its now a secondary owner.

If a GET is performed on Node 3 for E it should fetch/store it from the original owner: Node 4.

Or, you could say fetch it from the secondary owner: Node 4.

If a GET is performed on Node 3 for C or D it should fetch/store it from the current owner: Node 2.

The only problem I see at the moment is that original secondary owners still have their entires even though they should have been invalidated. In this case, Node 1 would still have E. This would need to be handled.

I'm inclined to think that we may be able to solve two problems here. One is the performance hit of state transfers, two is the inconsistency with not propagating evictions. Or, perhaps we can now make state transfer async since we have this 'on demand' approach that will work until it is complete.

Thought?

1. Lazy / On Demand State Transfer

galder.zamarreno Mar 9, 2011 9:00 AM (in response to shane_dev)

Well, the problem that I see with your solution is failover. So, imagine a node goes down and you do no rebalancing. If numOwners were two, keys that belonged to that node will only be present in one node, and if this one fails...
Actions
2. Lazy / On Demand State Transfer

shane_dev Mar 9, 2011 1:45 PM (in response to galder.zamarreno)

I would agree. Unfortunately, this is a difficult problem to solve. I've now worked on a few projects that all suffered from the same issue and it is so severe that going to production becomes questionable.

The cache was so large that joins continually failed. It forever stays in an INSTANTIATED state until it times out and moves to FAILED. I've mentioned this in another thread. Basically, we're trying to find something between no rehashing and a full rehash. Something that does't kill our cluster because state transfer takes 10 minutes yet gives us the opportunity to eventually end up fully balanced.
Actions

Go to original post