3 Replies Latest reply on Mar 14, 2013 6:53 AM by paulpa63

Infinispan disaster recovery/durability

cbeer Mar 7, 2013 9:15 PM

Hi all,

I have some basic questions to help get my head about how Infinispan handles persistence in several disaster recovery scenarios. Before setting out to derive the answer empirically, I thought I'd go to the source. We're using Modeshape and storing all of our content in Infinispan. We want to ensure our content is as durable as possible, by using clustering, cache chaining, etc. I think I have a grasp on clustering, so these questions about mainly about using multiple cache loaders.

First, we've configured our loaders with passivation false (this ensure content is stored not only in RAM, but also in one (or all) of the loaders?)

And then we have two FileCacheStore instances, in this case, pointing to two different directories:

</properties>

</loader>

</properties>

</loader>

So, in a terrible disaster, we lose the content in /path/b. Is there anything we need to do to repopulate the 2nd cache loader?

Now we lose the content in /path/a. Is there anything we need to do to repopulate the 1st cache loader?

I'm also interested in the behavior of fetchPersistentState on either of the cache loaders (which, I confess, I barely grasp right now.) I believe in this case, it doesn't really matter which cache loader we set fetchPersistentState on, right?

Now, instead of two different FileCacheStores, we decide to write our own CacheStore that persists content to, say, Amazon Glacier, where storage is relatively cheap but retrieval will take quite some time. In that case, I assume we'd want to set fetchPersistentState on the FileCacheStore, and transfer to the other store using <async>? Now, when we use <async>, we'd like a way to check if the content was actually successfully stored at some future point in time (because we're paranoid). It sounds like we'll have to reach deep into Infinispan to get a handle to the cache store and check ourselves, right?

Finally, let's say we run out of room on our spinning disk, but our Glacier cache store is "infinitely" deep. Does Infinispan provide a mechanism (similar to eviction, I guess) to remove content from a cache store (on the assumption that it's in a different cache store)? Or would we have to roll-our-own?

Thanks,

Chris

1. Re: Infinispan disaster recovery/durability

mircea.markus Mar 8, 2013 10:23 AM (in response to cbeer)

So, in a terrible disaster, we lose the content in /path/b. Is there anything we need to do to repopulate the 2nd cache loader?
Now we lose the content in /path/a. Is there anything we need to do to repopulate the 1st cache loader?
we don't have out of the box tooling for that, but copying the dir from one place to the other, when cache is stopped, should work.
Another thing you can do is iterate over the keys in the local cache (Cache.keySet() returns the keys in the local node) and re-write them to the cache using the org.infinispan.context.Flag.CACHE_MODE_LOCAL flag.

I'm also interested in the behavior of fetchPersistentState on either of the cache loaders (which, I confess, I barely grasp right now.) I believe in this case, it doesn't really matter which cache loader we set fetchPersistentState on, right?
yes, and that particular loader is used when starting the cache for populating the memory with whatever the cache store has. The only constraint is that you can only have a single cache loader that has fetchPersistentState set to true.

Now, instead of two different FileCacheStores, we decide to write our own CacheStore that persists content to, say, Amazon Glacier, where storage is relatively cheap but retrieval will take quite some time
We already have a cloud cache store based on jcloudes, you might want to see if that can be used, or extended.

In that case, I assume we'd want to set fetchPersistentState on the FileCacheStore, and transfer to the other store using <async>?
that sounds sensible.

Now, when we use <async>, we'd like a way to check if the content was actually successfully stored at some future point in time (because we're paranoid). It sounds like we'll have to reach deep into Infinispan to get a handle to the cache store and check ourselves, right?
That or expose the information you want as JMX in the cloud cache store (or your new implementation).

Finally, let's say we run out of room on our spinning disk, but our Glacier cache store is "infinitely" deep. Does Infinispan provide a mechanism (similar to eviction, I guess) to remove content from a cache store (on the assumption that it's in a different cache store)? Or would we have to roll-our-own?
you can define expiration for the entries you add, but this cannot be configured on a per cache/loader basis, i.e. the entry will be expired from all cache loaders at the same time. Of course you can ignore the expiration on your custom (or extended from our cloud cache store) loader, which I think would do the trick for you.
Actions
2. Re: Infinispan disaster recovery/durability

cbeer Mar 13, 2013 10:16 AM (in response to mircea.markus)

Thanks Mircea, I've had a chance to play with these ideas and have some more questions:

Mircea Markus wrote:

So, in a terrible disaster, we lose the content in /path/b. Is there anything we need to do to repopulate the 2nd cache loader?
Now we lose the content in /path/a. Is there anything we need to do to repopulate the 1st cache loader?
we don't have out of the box tooling for that, but copying the dir from one place to the other, when cache is stopped, should work.
Another thing you can do is iterate over the keys in the local cache (Cache.keySet() returns the keys in the local node) and re-write them to the cache using the org.infinispan.context.Flag.CACHE_MODE_LOCAL flag.

The online-restore approach seems to meet our needs just fine. It meshes nicely with a related problem we're also trying to tackle: data fixity of large binary blobs (e.g. how do we know the content in a cache is what we put into it). I've been able to do a little work here and things look promising.

However, how about in a clustered scenario? Is it appropriate to write custom Commands for the task? (I had a brief look at the Map-Reduce functionality, and it seems like it expects to operate on a common Cache<>). A custom Command also seems like it'll give us more specificity in targetting Addresses to run on, which may be a benefit.

Now, instead of two different FileCacheStores, we decide to write our own CacheStore that persists content to, say, Amazon Glacier, where storage is relatively cheap but retrieval will take quite some time
We already have a cloud cache store based on jcloudes, you might want to see if that can be used, or extended.

Thanks, I'll take a look again. At least, it might address putting content into Glacer (or otherwise). When we tackled this same problem without ISPN, one of the challenges we had was persisting some Glacier-specific state (mapping our identifiers into their identifiers). I guess, with ISPN, there's nothing stopping us from spinning up another Cache instance to do that mapping.

Now, when we use <async>, we'd like a way to check if the content was actually successfully stored at some future point in time (because we're paranoid). It sounds like we'll have to reach deep into Infinispan to get a handle to the cache store and check ourselves, right?
That or expose the information you want as JMX in the cloud cache store (or your new implementation).

(Sorry, I haven't done due diligence in asking this related question, but:) in a clustered-distributed scenario, is there an API we should use that, given a NodeKey, we can discover which Addresses are supposed to have the key?
Actions
3. Re: Infinispan disaster recovery/durability

paulpa63 Mar 14, 2013 6:53 AM (in response to cbeer)

Hi Chris,

I read this discussion article with interest as it relates to some of the requirements for our system. I was particularly interested in your comment re "data fixability of large binary blobs" something which is also of great concern to us. (I have recently raised a discussion article "Data Location" which has unfortunately not received any response.) I was wondering whether you would be prepared to divulge a few details on the approach you have been investigating / developing to handle this problem.

Paul
Actions

Go to original post