8 Replies Latest reply on Jul 6, 2012 11:37 AM by dex80526

how to retrieve a given number of cache entries in batch

dex80526 Jun 6, 2012 4:41 PM

I stored data in infinispan cache store, and need a way to retrieve a set of data in batch. In other words, I am looking for functionality like paging or select with given filters/crieria.

For example, List<CacheData> list = cache.get(first_100);

or List<CacheData> list = cache.get(older then 1 days);

I know this might be invalid or different for differen configuration (DIST, REPL).

Any idea and sugestions.

1. Re: how to retrieve a given number of cache entries in batch

galder.zamarreno Jun 12, 2012 10:22 AM (in response to dex80526)

This is not currently possible AFAIK
Actions
2. Re: how to retrieve a given number of cache entries in batch

galder.zamarreno Jun 12, 2012 10:23 AM (in response to galder.zamarreno)

Well, not possible unless you enable the query module and index the contents, and then use lucene queries to get what you want.

Or, you use map/reduce or distributed executors to retrieve all data in memory and apply the filter that you want.
Actions
3. Re: how to retrieve a given number of cache entries in batch

galder.zamarreno Jun 12, 2012 10:24 AM (in response to galder.zamarreno)

Also, if you preload the data, the in-memory data will contain everything that's in the cache store.
Actions
4. Re: how to retrieve a given number of cache entries in batch

dex80526 Jun 12, 2012 11:12 AM (in response to galder.zamarreno)

This is a big problem for us, sicne a cahce could have millions entries, and we can not use passivation since we want the cache store always has the full data in the cache for persistent purpose. In this setting, we can not have a good way to serve API calls such as get first 100 entries. I posted another question on related issue to have all data in cache store, but only load part of data into memory.

I am running to other scalability (perfomrance) issues which I'll post soon. After a few month using Infinispan, it seems working nicely for small set of data. Now, we see many issues when we try to scale up in terms of data and cluster size.
Thanks.
Actions
5. Re: how to retrieve a given number of cache entries in batch

galder.zamarreno Jun 18, 2012 6:06 AM (in response to dex80526)

If you're using a replicated cache, entrySet() should get you all entries and you should be able to loop and retrieve first 100, but I guess with such a big cache you'd be using DIST.

An alternative that does not require query/indexing is to use map/reduce, with which DIST, you could load the first 100 entries in each node and do what you need to do, by calling entrySet() in each node and looping through the local collection.

The problem with such type of operations is how to deal with concurrent modifications, an entry set, or any type of sets we provide are living entities, since we don't do any set copies when requested. You just get an immutable iterator to the contents, so the results might vary with concurrent modifications.
Actions
6. Re: how to retrieve a given number of cache entries in batch

dex80526 Jun 18, 2012 2:21 PM (in response to galder.zamarreno)

I have not tried the query/indexing stuff yet. According to my reading, it is more like gearing towrads full-text search ( I might be wrong). In that case, it does not fit our needs.

I'll check out map/reduce.

In our case, the size of cluster is not big (we are talking mainly 2, 3, 5 nodes), mostly is 2 or 3 node clusters. In 2 node clusters, replication and dist is pretty much same. We use infinispan for 2 purposes: 1) providing in memory cache (local or distributed/replicated) and 2) persistent data through cache store. We have to scale up to millions entries.

The current Infinispan services (cache and cache store with small data set) meet our above use cases well. But, we have challange to scale up to large data and high load (with 2/3 node replication). We compared the performance data bewteen replication disabled and enabled wtih 2 node and 3 node clusters. The difference is so hughe that we have to provide a option to disable the replication (async) if the environment has high load.

Next challange for us is to scale up in terms of number of cache entries. The eviction makes sense for this. We can specify the maxEntries loaded in memory, but keeps all data on disk. But, we have to come up a mechanism to provide "paging" kind functionality to retrieve entries are not in memory yet if need for our clients, such as API calls as mentioned earlier.

To me, it would be really useful if ISPN implements somthing like this, such as in cache store/cache API. I understand the behaviour could be different for different cluster modes.
Actions
7. Re: how to retrieve a given number of cache entries in batch

galder.zamarreno Jul 5, 2012 7:47 PM (in response to dex80526)

Hmmm, did you do any network/JGroups tuning at all to try to improve speed of replication? Seems like when you go into a cluster things start to slow down.

At the JGroups/network level, there are some enhancements that can be done, see: https://community.jboss.org/docs/DOC-11595

A lot of the suggestions there are explained in greater detail in http://www.informatica.com/downloads/1568_high_perf_messaging_wp/Topics-in-High-Performance-Messaging.htm
Actions
8. Re: how to retrieve a given number of cache entries in batch

dex80526 Jul 6, 2012 11:37 AM (in response to galder.zamarreno)

Galder: Thanks for the links. No, I have not tried specific kernel/system level tuning yet.
Actions

Go to original post