8 Replies Latest reply on Jul 6, 2012 11:37 AM by dex80526

    how to retrieve a given number of cache entries in batch

    dex80526

      I stored data in infinispan cache store, and need a way to retrieve a set of data in batch. In other words, I am looking for functionality like paging or select with given filters/crieria.

       

      For example,  List<CacheData> list = cache.get(first_100);

                          or List<CacheData> list = cache.get(older then 1 days);

       

      I know this might be invalid or different for differen configuration (DIST, REPL).

       

      Any idea and sugestions.

        • 1. Re: how to retrieve a given number of cache entries in batch
          galder.zamarreno

          This is not currently possible AFAIK

          • 2. Re: how to retrieve a given number of cache entries in batch
            galder.zamarreno

            Well, not possible unless you enable the query module and index the contents, and then use lucene queries to get what you want.

             

            Or, you use map/reduce or distributed executors to retrieve all data in memory and apply the filter that you want.

            • 3. Re: how to retrieve a given number of cache entries in batch
              galder.zamarreno

              Also, if you preload the data, the in-memory data will contain everything that's in the cache store.

              • 4. Re: how to retrieve a given number of cache entries in batch
                dex80526

                This is a big problem for us, sicne a cahce could have millions entries, and we can not use passivation since we want the cache store always has the full data in the cache for persistent purpose. In this setting, we can not have a good way to serve API calls such as get first 100 entries. I posted another question on related issue to have all data in cache store, but only load part of data into memory.

                 

                I am running to other scalability (perfomrance) issues which I'll post soon. After a few month using Infinispan, it seems working nicely for small set of data. Now, we see many issues when we try to scale up in terms of data and cluster size.

                Thanks.

                • 5. Re: how to retrieve a given number of cache entries in batch
                  galder.zamarreno

                  If you're using a replicated cache, entrySet() should get you all entries and you should be able to loop and retrieve first 100, but I guess with such a big cache you'd be using DIST.

                   

                  An alternative that does not require query/indexing is to use map/reduce, with which DIST, you could load the first 100 entries in each node and do what you need to do, by calling entrySet() in each node and looping through the local collection.

                   

                  The problem with such type of operations is how to deal with concurrent modifications, an entry set, or any type of sets we provide are living entities, since we don't do any set copies when requested. You just get an immutable iterator to the contents, so the results might vary with concurrent modifications.

                  • 6. Re: how to retrieve a given number of cache entries in batch
                    dex80526

                    I have not tried the query/indexing stuff yet. According to my reading, it is more like gearing towrads full-text search ( I  might be wrong). In that case, it does not fit our needs.

                     

                    I'll check out map/reduce.

                     

                    In our case, the size of cluster is not big (we are talking mainly 2, 3, 5 nodes), mostly is 2 or 3 node clusters. In 2 node clusters, replication and dist is pretty much same.  We use infinispan for 2 purposes: 1) providing in memory cache (local or distributed/replicated) and 2) persistent data through cache store. We have to scale up to millions entries.

                     

                    The current Infinispan services (cache and cache store with small data set) meet our above use cases well. But, we have challange to scale up to large data and high load (with 2/3 node replication). We compared the performance data bewteen replication disabled and enabled wtih 2 node and 3 node clusters. The difference is so hughe that we have to provide a option to disable the replication (async) if the environment has high load.

                     

                    Next challange for us is to scale up in terms of number of cache entries. The eviction makes sense for this. We can specify the maxEntries loaded in memory, but keeps all data on disk. But, we have to come up a mechanism to provide "paging" kind functionality to retrieve entries are not in memory yet if need for our clients, such as API calls as mentioned earlier.

                     

                    To me, it would be really useful if ISPN implements somthing like this, such as in cache store/cache API. I understand the behaviour could be different for different cluster modes.

                    • 7. Re: how to retrieve a given number of cache entries in batch
                      galder.zamarreno

                      Hmmm, did you do any network/JGroups tuning at all to try to improve speed of replication? Seems like when you go into a cluster things start to slow down.

                       

                      At the JGroups/network level, there are some enhancements that can be done, see: https://community.jboss.org/docs/DOC-11595

                       

                      A lot of the suggestions there are explained in greater detail in http://www.informatica.com/downloads/1568_high_perf_messaging_wp/Topics-in-High-Performance-Messaging.htm

                      • 8. Re: how to retrieve a given number of cache entries in batch
                        dex80526

                        Galder: Thanks for the links. No, I have not tried specific kernel/system level tuning yet.