I'm in the process of building an architecture for a specific use case.
My colleague and I have been spending quite some time trying to find a proper solution, and we've managed to get to our current point.
Allow me to describe our use case:
- We need to process a high-rate objects per second. The goal is to reach about 50k objects per second, in a distributed solution. Our hopes are to reach about 10k for a single server (obviously, we'll be happy to get more).
- The objects are roughly 2kb in size, but on peaks can reach up to 100kb.
- The objects need to be persistant as we can't allow data to be lost.
- We need to index the objects by different fields in them. This needs not be persistent since we can always build it again from the objects themselves.
- Each object is read at least once (and at most cases only once), and most likely to be deleted right afterwards. The order in which the objects are read is random.
- For now, there's absolutely no expiration for any of the objects or indexes.
Due to the nature of the use case, we cannot use a relational database to store the objects (both in terms of throughput and scaling out).
So far, we've mostly checked into NoSQL solutions, the primary candidate being Cassandra.
We've had fair success with it, but when it comes to implementing our whole process (which is not described in the use case), the performance went downhill. We realized that we don't really utilize the memory of the machine, and we decided to look into a solution that involves a distributed cache (or data grid) to hold mostly the indexes as those are important for our processing of the events (basically, we need to group the objects according to some indexes, and once we have the group we can read them in a batch and process them all together).
Upon our research, we came along Infinispan as a possible solution for our use case.
So far, I've asked quite a few questions in the IRC channel, and got much help from Manik, Mircea, Vladimir and Patrick, but as Vladimir suggested, I'm bringing this as a discussion in the forum for the general help of (and for) the community.
We've started out using IS 4.0 FINAL but recently downloaded and installed IS 4.1.0CR1.
Right now, our first issue, is to decide whether to use IS's cache as a backstore instead of using Cassandra. In this situation, we configure the objects' cache to be write-through (i.e., whatever is written to the cache is automatically also written to the disk). To simplify the test, we're currently only writing objects, not reading or deleting them.
We've tried using JDBM and FIleCacheStore, but we seem to get relatively poor results. IIRC, with JDBM we get about 100 objects per second and with FileCacheStore we get about 400 objects per second.
When compared with Cassandra, we were able to get about 10k objects per second (for writes alone, no concurrent reads and deletes).
We can't use the BDB solution due to licesing issues.
I admit I may have gotten a few suggestions as to what to check from the previously mentioned people, but since I'm not at work, I can't repeat those suggestions.
I'd appreciate any suggestions as to how to process. We prefer to use IS alone without the use of Cassandra, but if there's no other choice, we might have to use it anyways (either by a separate process or implementing a new data store interface).
Thanks for the help, and have a great weekend.