6 Replies Latest reply on Oct 30, 2013 6:55 PM by markaddleman

    Caching: A more sophisticated engine<->translator conversation

    markaddleman

      There are plenty of times when the result set cache already contains sufficient or nearly sufficient data to answer a query.  For example, the client executes SELECT * FROM T followed by SELECT COUNT(*) FROM T.  I suspect that the query planner could pretty easily determine this condition existed and rewrite the query appropriately.  The translator could be consulted to determine if the engine ought to use the cache or the data source.

       

      A more interesting case is when the cache contains a subset of the data needed for a result.  A key use case is append-only data sources like a log file.  The client executes SELECT * FROM LOG and the results get cached.  The log grows and then the client executes SELECT * FROM FROM LOG again.  Here, the cache might have a substantial part of the result set.  It might be most efficient for the translator to retrieve only the new rows, insert them into the cache and then inform the engine that the cache should be used.

       

      I think the situation gets complicated really fast when you start combining the first case with the second.  Another complicating factor is cache aging.

       

      Thoughts?

        • 1. Re: Caching: A more sophisticated engine<->translator conversation
          shawkins

          > I suspect that the query planner could pretty easily determine this condition existed and rewrite the query appropriately.

           

          In trivial cases yes, using existing cache entries to pull results is a good idea - but in general it is much more complicated and leads to tracking basically down to either the row or worse every value level.  Tracking at an "entity" or row level makes some sense as this is effectively what ORM solutions like Hibernate do - but the real benefit is in the complexity of the entity such that you have joins lazily/proactively performed etc. and a whole object graph is effectively cached.  For arbitrary sql there is no notion of a higher level entity and any usage of joins, alternative projections, etc. make reuse of similar cache entries in many circumstances as expensive as pulling new results. 

           

          So while we could do a better job in some trivial cases - such as the SELECT * FROM a single table - there is still a lot of complexity in what is matching cache hit there in terms of updates performed outside of Teiid, non-deterministic affects at the translator and below, etc. to where we would still prefer that users be declarative in their caching needs.  And in the case of SELET * FROM blah, more than likely an internal materialized view should be used.  Another consideration even in the trivial case is that as data volume grows you will need indexes to speed up pk and other access (whereas a results cache entry is just a simple linear store).

           

          > It might be most efficient for the translator to retrieve only the new rows, insert them into the cache and then inform the engine that the cache should be used.

           

          Here again internal materialization or a materialization table in general are updatable.

          • 2. Re: Caching: A more sophisticated engine<->translator conversation
            markaddleman

            Agreed - the complexity quickly becomes overwhelming in the general case.  I am intrigued by the idea that under some cases, like logging, we could get a big benefit though.  As you point out, to keep the complexity at bay, a more sophisticated caching strategy will probably need to take application semantics into account.  As a (hopefully) low cost first step, what do you think about a API to Teiid's caching behavior?  My naive notion of the caching API is a translator that must expose a three column schema: user id, query and data.  The data is represented as iterable of an array of objects.  I hasten to add this is all speculative.  I think the idea is interesting and I'd probably play around with it but we have no pressing need for this functionality at the moment.

            • 3. Re: Caching: A more sophisticated engine<->translator conversation
              shawkins

              > As a (hopefully) low cost first step, what do you think about a API to Teiid's caching behavior?


              The user/session/sql/parameters/vdb form our cache keys.  But this is specifically for result set caching - where we just hold a memory safe linear structure.  There are no indexes or understanding of what the data represents.  These structures are appendable - although that would currently not be tread-safe.  Are you saying that you want access to the api so that translators can proactively populate/invalidate/modify entries?  My initial reaction would be that this should be done within the translator itself.  Otherwise it seems better to go with a materialized view, which would have the querable/updatable properties that you're looking for.

              • 4. Re: Caching: A more sophisticated engine<->translator conversation
                markaddleman

                > But this is specifically for result set caching - where we just hold a memory safe linear structure.  There are no indexes or understanding of what the data represents

                 

                Part of what I'm looking for is the ability to replace the existing cache implementation with my own - something that takes application semantics into consideration.

                 

                > Otherwise it seems better to go with a materialized view, which would have the querable/updatable properties that you're looking for.

                 

                One quality I'm looking for is lazily loading the cache with a smaller load scope and invalidation scope than a materialized view (which, as I understand it, is both load and invalidation scoped to the entire table).

                 

                Thinking about our application:

                1. Queries typically (though not always) include the full primary key column in the where clause or the primary key is fully qualified in joins
                2. Translators typically (though not always) support order-by and do not support group-by
                3. We have lots of tables that are medium sized (10k - 100k rows) that would be expensive to materialize entirely but small subsets of the table would be very useful to cache
                4. Some of our tables are append-only and, depending on user behavior, we might want to cache subsets of the table along with any inserted rows

                 

                Putting this all together, I'm thinking that if the query that is presented to the translator is not an aggregate query and includes all of the primary key columns (something that is almost always the case in our application), it may be useful to cache each non-primary key column individually in a key-value store.  The key would be some scope (vdb, user, session, etc) + table name + primary key.  Perhaps, for efficiency, the values would be indexed in a btree.  By caching the data at the atomic (cellular?) level, it could be reused in other queries and easily invalidated.  The approach would be to intercept the query pushed to the translator.  Compare its data requirements to the cache and determine any missing data.  Given the missing data, take a cost-based approach to obtaining the remaining data.  Any newly retrieved data is cached and the full data set is returned to Teiid.

                 

                Since the data would be retrieved in a piecemeal fashion, it may be useful for the translator to report that its result set is in response to a different query than was pushed to it.  For example, if the engine had pushed SELECT a,b,c FROM t ORDER BY a to the translator but some data was retrieved from cache and some from the source, it would be nice if the translator could report that the actual data returned matched the SELECT a,b,c FROM t so the engine would know it had to apply the ordering.  The same rationale applies to aggregates:  translator may support group by but for any given query, the cache may contain all of the individual rows necessary to compute the aggregate.  In this case, it would be nice if the translator could indicate that it prefers to answer a different query, the engine would re-plan taking the desired query into account.  Obviously, the translator's preferred query would have to be "wider" than the original query. 

                 

                Talking through this, I believe there are two new notions for Teiid:

                1. A facility to replace the existing cache implementation with another implementation.  I don't think the actual interface matters very much but it seems natural (to me) that the interface is a translator.
                2. A re-plan mechanism in the case that the translator wants to answer a different question than the engine asked

                 

                > My initial reaction would be that this should be done within the translator itself.

                 

                Yes.  I could start implementing some of this using a delegating translator.  Obviously, the only missing piece is the re-plan mechanism but for now, I would simply avoid consulting the cache for queries that contained order-by or group-by statements. 

                • 5. Re: Caching: A more sophisticated engine<->translator conversation
                  shawkins

                  > By caching the data at the atomic (cellular?) level, it could be reused in other queries and easily invalidated.

                   

                  That seems far too expensive.  It seems like you really want to focus on what the data/queries represent rather than on a generic system.  Effectively it seems like you are saying there are logically related sets of queries (typically the same from clause / aggregation / etc.) that all retrieve effectively the same data - so conceptually a view.  So what you are getting at could be accomplished through the paradigm of materialization rather than result set caching.  Materialization tables are updatable (although we need to add some ease of use in this regard since internal materialization only supports single row update/insert/delete) and understand how to answer queries looking for a subset of the information.  What may be missing from this picture are more generic trigger capabilities so that you intercept/control the freshness of the data on demand.

                   

                  > Compare its data requirements to the cache and determine any missing data.  Given the missing data, take a cost-based approach to obtaining the remaining data.  Any newly retrieved data is cached and the full data set is returned to Teiid.

                   

                  There are lazy/incremental approaches to materialization that roughly fit here as well.  The use of a dirty and or version number on materialized rows can give you a way to ensure freshness for less than a full refresh.

                  • 6. Re: Caching: A more sophisticated engine<->translator conversation
                    markaddleman

                    > By caching the data at the atomic (cellular?) level, it could be reused in other queries and easily invalidated.

                     

                    That seems far too expensive.

                    Maybe.  I'm thinking about something somewhat exotic like using Riak or Redis.

                     

                    Materialization tables are updatable (although we need to add some ease of use in this regard since internal materialization only supports single row update/insert/delete) and understand how to answer queries looking for a subset of the information.

                    Interesting.  I had tried issuing an update / insert against a materialized view and the command was passed down to the source.  I must be doing something wrong so is there a test or something you can point me to?

                     

                    What may be missing from this picture are more generic trigger capabilities so that you intercept/control the freshness of the data on demand.

                    Yes.  The more control we have over the cache (whether the cache is implemented as a materialized view or something else), the better.  Simply being aware of queries against the cache would allow for a great deal of control.  I see the cache operating in a pipeline-like architecture between the query engine and the data source where each link in the pipe is a translator:  The engine would issue a query to the cache, it's translator makes a decision whether to serve data from the cache or from the source.  Talking it through in those terms makes me think there is enough Teiid support in 8.6 to build it:  The cache would be served through an external mat view whose translator has the smarts to either serve the data from the mat view's source or pass along the query to the underlying source using the TeiidConnection provided in the CommandContext.

                     

                    I think the implementation is going to end up being somewhat convoluted, particularly around setting up the VDB (a lot of duplicate metadata information) and, perhaps, rewriting the query to hit the original source.