11 Replies Latest reply on Jun 16, 2011 12:18 PM by sannegrinovero

    Multi Tier Use Case

    cbo_

      The use case involves 3 tiers of machines.  The tier that I see as hosting state is 10 machines.  There is another tier of machines with multiple VMs each, this tier has dozens of machines.  This tier will do a limited amount of get and limited put.  The last tier numbers in the 100s and will primarily do get/query.   I would say query straight away, but perhaps a hacked key can be created such that query is not necessary.

       

      Total cache size will not be very large....most likley having less than 10K entries and not changing frequently.  Entries are fairly lightweight.

       

      Replication won't work for us since one requirement is to avoid being dependent on a choke point if there is a mass startup (i.e. the coordinator in Replication mode).  As described by the above tiers, a mass startup could include several 100 VMs.

       

      Distribution mode (with all tiers holding state) seems ugly as all would own/manage keys.  Would prefer if that is limited to some smaller set of hosts.  Especially since the tier with 100s are primarily getters.

       

      We were thinking HotRod (HR) Servers in Dist mode for the 10 machines that are hosting state.  The rest would be HR clients.

       

      There is a requirement to do some querying on value objects.  Since query (I believe) is not yet supported by HR.  That being the case, we may need to build some interceptor to either do something with this "query" or impliment a kind of indirection to another cache to handle the mapping.  Just ideas....I put "query" in quotes since it seems a query to a value object could be turned into a get by key and the value object used as a key to some other cache.  All of this on the idea that query is not supported in HR at this point.

       

      Really want to solicit other ideas and keep it simple.

       

      Would like to build a prototype this week either way.

       

      I appreciate the feedback.  Thank you.

        • 1. Re: Multi Tier Use Case
          manik

          You are right that HR endpoints on the 10 "data nodes" is the right approach.  As for querying, the way things stand at the moment, if you can avoid it and do direct key lookups that would work best. 

           

          The next best thing would be a slightly more convoluted pattern involving:

           

          • Data tier runs Hot Rod
          • "Client" tier (that needs querying) runs local mode Infinispan instances, with a RemoteCacheStore to point to the HR servers
          • "Client" Infinispan instances enable querying.

           

          This would mean that while data resides on the 10 nodes, clients maintain the indexes.  Indexes (based on Lucene) could be shared (NFS) or maintained individually.  But the tricky thing may be rebuilding that index if the client nodes were to be restarted independent of the data tier.

           

          Personally, I think that if you have a deterministic access pattern (and not an ad-hoc, "full text search"), you are better off encoding this in keys for direct lookup.

          • 2. Re: Multi Tier Use Case
            sannegrinovero

            Hi craig,

            correct Query doesn't currently run via hotrod. It wouldn't be hard to do as we only need to define some extra commands and think how to best serialize the queries, but also it doesn't need to run over hotrod as it can't take advantage of the smart routing features of hotrod, so ATM I guess any messaging library would achieve the same result (in case you can't wait for hotrod): my favourite architecture is to keep the index and search engine inside the grid, connecting from the clients only to send queries and receive result lists.

             

            Consider that both Query and the Lucene indexes (which should never ever be shared on NFS, Manik keeps forgetting that little detail) are well suited for full-text search and some specific kinds of data mining; if all you need is grouping by some specific property, then the Map/Reduce API via the distributed executor could suite you more.

             

            Definitely you should take advantage of the key/value efficiency of Infinispan: if you have a very limited number of puts you could cache the query results storing the result in the grid in an additional cache, having as key the type of query and parameters; these result sets could be maintained in synch with an event listener, by directly updating the pre-computed lists in your listener or by triggering the Lucene or M/R Query to update the relevant results.

            • 3. Re: Multi Tier Use Case
              cbo_

              Thanks to both for the ideas.  I wanted to explore the map/reduce idea some more at the moment and solicit some additional feedback.

               

              Sanne,

              You mentioned that if I needed grouping by some specific property I could apply Map/Reduce API via distributed executor.  Wonder if we can take a bit of an example further so that I can better appreciate the value of MR for this use case since I am new to the MR approach.  I assume we are talking about some custom interceptor within the HR Server cluster that would apply the MR api? 

               

              The example I'd like to see if we can take further is as follows:

              Let's say there are 2 properties in my value object in my cache that I wanted to match on for my query.  Call them property1 and property2.  I need to apply an "and" type relationship on hitting both properties to satisfy my query.  So, property1=x and property2=y.  Is the approach to split efforts to find all value objects that have property1=x and separately put an effort to find all value objects that have property2=y and then reduce those separate results to a combined result?  And, does this imply that finding the properties of value object would be accomplished by examining each value object (i.e. walking the entire cache)?

               

              Thank you.

              • 4. Re: Multi Tier Use Case
                sannegrinovero

                Hi Craig,

                you don't need a custom interceptor, all needed support is part of Infinispan 5; see also http://infinispan.blogspot.com/2011/01/distributed-executors-come-to.html

                 

                Matching two properties is very easy, in fact I have shown a similar example at last JUDCon in Boston, you can find my slides here: the presentation is named "Advanced Queries on the Infinispan Data Grid" (especially page 12).

                 

                 

                 

                Is the approach to split efforts to find all value objects that have property1=x and separately put an effort to find all value objects that have property2=y and then reduce those separate results to a combined result?

                No, it will iterate all values and of each of them you can verify both conditions before collecting it.

                 

                And, does this imply that finding the properties of value object would be accomplished by examining each value object (i.e. walking the entire cache)?

                Yes, but it can be parallelized very well, and each object replica is examined only once. About parallelization, I must warn that it's not fully implemented so it might be currently not as fast as it should be. It's similar to a database table scan, but of course it's in memory so it's not as bad as it sounds.

                If you need indexes, then you need Lucene. But preparing indexes has a cost as well, if you could store the result in the cache as mentioned above, keyed under the query parameters, at least you can reuse the result multiple times.

                 

                I have a working test about the code mentioned in the JUDCon slides, I will try polish it and include it in Infinispan so you can have a look, but really the code is super simple.

                • 5. Re: Multi Tier Use Case
                  cbo_

                  Hi Sanne,

                   

                  Sounds like I would want to implement a class that extends Maper which would iterate all values looking for a match on both property1==x and property2==y.  If matching then it will collect it.  Then, a class that implements Reducer would return the "answer" to the lookup.

                   

                  How does this fit into my topology, which was suggested to be a combination of HR Clients and HR Servers?  The cache is managed in the data tier by some 10 HR Servers.  What api call is made from the HR Client to the HR Server?

                  • 6. Re: Multi Tier Use Case
                    sannegrinovero

                    Currently HR Clients can't send search requests nor custom commands like Mappers or Reducers; as mentioned it shouldn't be hard to implement and we could discuss that, but there are some alternatives which are available now, and they might also be a better fit for your scenario, but please tell me what you think about it, and if that would work for your case:

                     

                    • You define the needed Map/Reduce tasks to perform the queries you need on the servers, not on the client.
                    • These tasks make sure that the output is available to clients under specific keys (if they where Strings, something simple like "emailsFrom:infinispan-dev|On:01/01/2011")
                    • A cache listener makes sure that the when a value is changed, a new M/R task is triggered to make sure the pre-calculated values are updated.
                    • HotRod clients connect to the server and download the pre calculated query by sending the query as "emailsFrom:infinispan-dev|On:01/01/2011".

                     

                    I wouldn't store the resulting values in the query cache, but the keys of matches. This way if you need to make sure they still match you can afford some inconsistency be eventually dropping from the results values which don't exists anymore or which don't match the conditions anymore. The worst thing that would happen is to not find some matches, but that should be acceptable as just a moment before they where not matching either.

                     

                    Of course, this would not be feasible if you have way more writes than reads as you would trigger an high amount of mapping tasks without no benefit, and it also makes no sense if the possible queries and query parameter combinations is very high.

                    • 7. Re: Multi Tier Use Case
                      cbo_

                      Sounds to me like that solution would only work for predefined queries.  Seems you have addressed the resultset changing by introducing the listener, but the query itself I think seems to be static in this solution.  That would not fit this use case. 

                       

                      Unfortunately, what seems would be of interest would be a sort of tier that could execute these lookups (DE or MR) into the cache tier, but not host state.

                       

                      So far, I guess the solution proposed above by Manik could simplify things and work, but it seems that solution would have full copies of the cache contents on all the client tier VMs.  Plus, if using Lucene indexes, that solution would also take a hit on constructing the indexes.  These applications are cycled daily so that hit would be there daily or anytime the app is recylced.

                       

                      Any other ideas..............?

                      • 8. Re: Multi Tier Use Case
                        sannegrinovero

                        Seems you have addressed the resultset changing by introducing the listener, but the query itself I think seems to be static in this solution.

                        I had understood that you are always querying on the same two fields. Of course the requested fields values might change, but assuming you don't have many of them you could pre-compute them all.

                         

                         

                        So far, I guess the solution proposed above by Manik could simplify things and work, but it seems that solution would have full copies of the cache contents on all the client tier VMs.  Plus, if using Lucene indexes, that solution would also take a hit on constructing the indexes.  These applications are cycled daily so that hit would be there daily or anytime the app is recylced.

                        I thought keeping a copy of all data on each client would not be an option. If it is, you could run a M/R locally and avoid any remote interaction.

                         

                        Also if you have lots of memory on the clients you could extend Manik's suggestions of running a local Infinispan instance backed by a remote cache loader, and not download only the values but download the index as well: using the Lucene Infinispan Directory you store the index in the cache, Query keeps it in synch, and when the hotrod clients connect they download the index as well.

                        There are two limitations with this:

                        • Your client caches must be read-only, you don't want to write to the local copy of the index.
                        • Occasionally when running a query it might need to fetch a fresh copy of an index segment, and this might result in some query executions to seem much slower than others. This can be prevented by using a index-warming background thread.

                         

                        This is to be very flexible with queries; if possible as Manik also said I'd pre-store the needed results in specific keys; is that not a simple solution?

                         

                        I still feel a bit like I didn't understand the problem

                        • 9. Re: Multi Tier Use Case
                          cbo_

                          Hi Sanne,

                           

                          The desire is there to do a query by just 1, 2, or perhaps 3 fields but the field value combinations will vary significantly.  Sorry if the example earlier oversimplified things.

                           

                          It is not desirable to keep full copies of the cache on each client tier node.  But, that is what I was understanding Manik's solution to imply.  I wanted to put that understanding down there in case I am not correct about that (not because it is desirable or needed).  Did I understand that correctly? 

                           

                          I like the idea to try to improve query times by caching the indexes.  But, there are very infrequent cases when the client tier will need to update the cache.

                           

                          Hope we can keep trying on this to figure out a best solution.

                          • 10. Re: Multi Tier Use Case
                            sannegrinovero
                            I like the idea to try to improve query times by caching the indexes.  But, there are very infrequent cases when the client tier will need to update the cache.

                            Ok we can elaborate on this direction. What about this:

                            • You use Infinispan Query on the grid to keep the fields indexed, and the index in synch with the values.
                            • Infinispan Query stores the index in the Infinispan Directory, on the same grid (possibly using different caches).
                            • The clients connect via HotRod to fetch the values by key lookup, and have a local index copy which replicates the one stored in the grid.
                            • Clients make updates not on the local index, but by sending updates to the grid - which will be reflected on the grid stored index, and then replicated locally

                             

                            So on the clients you can either:

                            • Use Query but not register or disable the eventlistener, so it won't ever write to the index
                            • Don't use Query and use the Lucene API to perform searches -> you only find keys, you still you lookup values from the grid

                             

                            Limitations:

                            • you likely want on the clients enough memory to cache as much as possible of the index, or it will be very slow and network intensive. The index is usually much smaller than the data, but that depends on your data and indexing needs; even if you could tell me I can't easily predict the index size so you'll have to try that.

                             

                            In practice you could implement this simplifying a lot if you use replication to store the index, and then allow for your clients to join the grid at least for the index caches - you can still use hotrod to store/load the values; this implies that when your clients boot they download a copy of the index in memory, then you could also have them passivate a part to their local disk using a non-shared cacheloader. I never tested this configuration, but it looks like to me a good combination. What do you think?

                            • 11. Re: Multi Tier Use Case
                              sannegrinovero

                              Ah, I have to clarify this: since Infinispan currently requires the same configuration on each node joining, you don't need just to run separate caches for the values and indexes, but you'll need two different CacheManagers: one for the values, one for the index.

                              On the index ones you'll use replication both on the servers and clients, while on the values one you'll have distribution on the servers and clients connecting via hotrod.