1 Reply Latest reply on Sep 4, 2014 7:29 PM by sannegrinovero

Infinispan as a document store?

brenuart Sep 2, 2014 12:14 PM

Dear all,

We plan to use Infinispan as our primary storage mechanism for unstructured data (mainly XML documents but also PDF and JPG files). The idea is to leverage Infinispan features like:

- transactional capabilities

- clustering to provide HA access to our document repository

- cross-site replication (for fast disaster recovery)

In addition we may also want to rely on Infinispan's query functionalities to index and search documents.

We need to store about 100.000.000 of those documents, each having an average size of 50Kb. The document store will be queried at a rate of 3 documents/secondes, most of the time using its "primary key".

Moreover, Modeshape on top of Infinispan looks interesting as well as it would bring the "document management features" out-of-the-box.

We are wondering if Infinispan is a good solution for such scenario and if not why would it "fail".

Any advices/suggestions are welcome. Nice if someone could share its experience on similar setup...

Thanks in advance,

/Bertrand

1. Re: Infinispan as a document store?

sannegrinovero Sep 4, 2014 7:29 PM (in response to brenuart)

Hi,
since Infinispan shines when you can keep all of its data in memory, your first step would be to try seeing if you have enough memory across your servers to fit it all (and have some spare memory over as well of course).
I guess you won't be able to keep it all in memory, so you are probably looking into the CacheStore capabilities of Infinispan to keep the "hot" data in memory while offloading most of it to a different storage engine.. probably the disks on each server?

Also to consider the index size. Since the index is usually significantly smaller than the data, we often aim at having a full replica of the index on each node. That's not strictly a requirement, as you can distribute it too, but the more "diluted" this information is the less performing your queries will be.
Although, it's designed to do millions of queries per second so since you aim at just 3 units this might not be a problem.. still I would advice to verify with a POC how large your index would get for the full dataset. I can't pull an estimate of index size as this wildfly depends on the indexing options you intend to apply, which in turn depends on which queries you will need to be able to run. If you're just indexing some metadata that should not be a problem at all to fully replicate the index.
Might be worth considering that Infinispan 7 will also be able to run queries without the need for any index: as with traditional databases, queries will be slower but you get a benefit in write performance and scalability. Bear in mind though that an indexless query will simply iterate all entries.

It's good to remind that Infinispan was designed as a Cache, and not as an ACID database: the transactional features are meant to be able to participate with other JTA transactions (go/no go for a batch of changes), but strict durability hasn't been a design objective; as a cache, it's expected that you can load data from another system in case of catastrophical failure.

Considering the very low load you're expecting, and assuming you will keep most data off-heap in a CacheStore, you should be fine with just a couple of servers. Probably the hardest part will be to load all the initial data.. in this case I generally suggest to not load it at all, but to develop a custom CacheStore to have it load "on demand" from the existing source.
Actions