Programatically invalid cache using cache API?| JBoss.org Content Archive (Read Only)

15. Re: Programatically invalid cache using cache API?

markaddleman May 3, 2013 7:26 PM (in response to shawkins)

> The default implementation would do nothing but, obviously, execution factories would override it to 'prep' in some way for the upcoming query such as invalidating the cache. I'm not sure what other useful things this method could do.

It's probably best to clarify which problem you're trying to address. Caching at a "user command" level a specific source query or taking a new approach to the existing session level or above caching.

The problem is that there are two ways to invalidate the cache: through an event asynchronous to query processing or through a computation done synchronous to query processing.

An example of an event aynschronous to query processing is a database trigger that fires on table modification. I can see how this is easily hooked up to the EventDistributor.

I don't see how to hook up cache invalidation synchronous to query processing. For example, if the data source provided a cheap call to get the last modified timestamp for a table. On each query, I'd want the translator to compare the data source's last modified timestamp to the time the data was cached. Since getCacheDirective() is currently only called when the cache is already invalidated, I don't see a clear path to hooking up to the EventDistributor.

16. Re: Programatically invalid cache using cache API?

shawkins May 3, 2013 7:54 PM (in response to markaddleman)

> I don't see how to hook up cache invalidation synchronous to query processing.

Synchrounous invalidation of the exising CacheDirective based mechanism would be based upon either:

- you could make the EventDistributor available to the Translator and then perform a proactive invalidation based upon incoming command

- Teiid could add a new flag to the CacheDirective to instruct it to be invalidating (my previous proposal of setting the initial scope to NONE won't actually work).

In either of the above there is also the consideration that the current getCacheDirective method does not pass a connection.

> Since getCacheDirective() is currently only called when the cache is already invalidated

getCacheDirective is consulted prior to the logical creation of the Execution for each source command.

17. Re: Programatically invalid cache using cache API?

markaddleman May 3, 2013 8:02 PM (in response to shawkins)

> Since getCacheDirective() is currently only called when the cache is already invalidated

getCacheDirective is consulted prior to the logical creation of the Execution for each source command.

Ah ha. That's the source of my confusion.

Synchrounous invalidation of the exising CacheDirective based mechanism would be based upon either:
- you could make the EventDistributor available to the Translator and then perform a proactive invalidation based upon incoming command
- Teiid could add a new flag to the CacheDirective to instruct it to be invalidating (my previous proposal of setting the initial scope to NONE won't actually work).

I don't have a strong preference either way. Is the EventDistributor an asynchronous operation? Obviousy, we'd want to block query processing until the cache invalidation operation is complete. I hasten to add that this isn't a requirement across a cluster, only within the local Teiid instance. I don't have a problem requiring each Teiid instance to invalidate caches independently.

Having said that, perhaps adding a CacheScope.INVALIDATE_LOCAL_CACHE is the better approach. It's intent is pretty clear and the approach allows for CacheScope.INVALIDATE_CLUSTER_CACHE in the future.

18. Re: Programatically invalid cache using cache API?

shawkins May 9, 2013 11:05 AM (in response to markaddleman)

> Is the EventDistributor an asynchronous operation?

Yes, it is based upon timestamping - and has the limitation that we track only down to the millisecond.

> Obviousy, we'd want to block query processing until the cache invalidation operation is complete.

Our approach at least with internal materialization is to support both a lazy (don't use the cached item, but overwrite it when successful) and a full invalidation. If we are specifically talking about a user query scope, there isn't much difference between the two.

Cluster vs. local invalidation would be a separate topic and may not be something that we have a lot of direct control over.

19. Re: Programatically invalid cache using cache API?

markaddleman May 9, 2013 12:21 PM (in response to shawkins)

> Is the EventDistributor an asynchronous operation?

Yes, it is based upon timestamping - and has the limitation that we track only down to the millisecond.

This is entirely acceptable.

Reviewing the conversation, here's my stab at the conclusions so far:

TEIID-2139 will include sharing cached results across plans with the introduction of hints to direct which nodes ought to be cached
No change is needed to invalidate the cache since the EventDistributor can be used to invalidate caches at the table scope
The engine behavior around continuous queries should change to only re-execute the query if there's a chance that the results have changed (I suspect we need to discuss this point further)
No new cache scope is needed

20. Re: Programatically invalid cache using cache API?

markaddleman May 20, 2013 9:40 AM (in response to shawkins)

ping

21. Re: Programatically invalid cache using cache API?

shawkins May 20, 2013 10:24 AM (in response to markaddleman)

Sorry I missed following up.

> TEIID-2139 will include sharing cached results across plans with the introduction of hints to direct which nodes ought to be cached

I don't think that will can function at a "node" level as that would quite a complicated hint. More than would be more akin to session/user scoped materialized views - with the option to specify particular subsets to materializes. Then as long as what is cached is a superset of what is requested in the user query, then we'll use the cached version.

> No change is needed to invalidate the cache since the EventDistributor can be used to invalidate caches at the table scope

> No new cache scope is needed

As long as you are comfortable with the session scope yes. Otherwise a "user query" scope and "source query" level invalidation would require new features.

> The engine behavior around continuous queries should change to only re-execute the query if there's a chance that the results have changed (I suspect we need to discuss this point further)

Yes more may be needed on this. With the current logic that means you'll specify a session scoped result and the user query will reexecute of the cached results until the entry is invalidated. If you want to pause all execution (no reprocessing of cached results) then that would take more thought as you aren't currently able for example to throw a datanotavaialble exception when returning the cache directive.

22. Re: Programatically invalid cache using cache API?

markaddleman May 20, 2013 12:58 PM (in response to shawkins)

I don't think that will can function at a "node" level as that would quite a complicated hint. More than would be more akin to session/user scoped materialized views - with the option to specify particular subsets to materializes. Then as long as what is cached is a superset of what is requested in the user query, then we'll use the cached version.

I think some examples of what you're thinking would help clarify.

> The engine behavior around continuous queries should change to only re-execute the query if there's a chance that the results have changed (I suspect we need to discuss this point further)

Yes more may be needed on this. With the current logic that means you'll specify a session scoped result and the user query will reexecute of the cached results until the entry is invalidated. If you want to pause all execution (no reprocessing of cached results) then that would take more thought as you aren't currently able for example to throw a datanotavaialble exception when returning the cache directive.

There are two aspects that concern me: First, if the entire results of a query are cached, it's a waste of cycles for the engine to requery so it seems that the caching system ought to (conceptually) throw DataNotAvailable after returning null from its result set (I don't think the caching system is implemented anything like a translator but I don't know how else to describe the desired behavior - separately, it raises interesting questions about what it would mean to implement the caching system as a translator). Obviously, throwing DataNotAvailable raises the issue of when to restart the execution. It only makes sense to restart the execution when the cache is invalidated - if the underlying data source can't supply new data then it can throw DNA. This brings me to my second point: The engine ought to restart the execution when any (relevant) portion of the cache is invalidated.

23. Re: Programatically invalid cache using cache API?

shawkins May 21, 2013 1:10 PM (in response to markaddleman)

> I think some examples of what you're thinking would help clarify.

Scoping materialization to the user/session should be pretty straight-forward. The more complicated situation is on demand or subset caching (the former being a lazy approach, the latter explicit). For example suppose you have a query "select * from view where blah" - having the ability to materialize at the user/session level the entire view is really only useful as a workaround to the global scoping limitation of materialization when in fact the views/resources may imply more specific results (such as using row level hasRole checks or the new row-level/column masking permissions). Ideally at least for security based user variations we'll have a better built-in store soon. Otherwise the more common issue is that you're really only concerned with a subset of the view in this and subsequent queries. One approach is to allow the caching layer to build the cached results of the view lazily - something like "select * from /+* cache(scope:user lazy) */ view where blah". In practice is somewhat difficult (especially without row identity) and may still require source queries with each access to verify that the cached rows contain the results needed. The other is to be explict in the cache hint what subset you want to initially fetch - "select * from /+* cache(scope:user select:'cols' where:'some predicate') */ view where blah" - here again though you still have the problem of how to handle usages of the view that request values out of the cached range.

> First, if the entire results of a query are cached, it's a waste of cycles for the engine to requery so it seems that the caching system ought to (conceptually) throw DataNotAvailable after returning null from its result set

Maybe some classification of continous source results would help. As I see it we have:

- static - expected to remain the same for each access in each reuse

- streaming - expected not to return a result null terminator

- windowed - logically streaming results, but terminated by some windowing (likely time) mechanism. a complication here would be if not all of the source windows are the same.

What you are describing implies that you only want to cache static results and that windowed results would be left to throw a DataNotAvailableException if they haven't hit their window start. A global source hint or execution payload can be used to communicate a window to all sources, but there's still no built-in handling implied. Alternatively some notion of windowing could be added to the RequestOptions and some basic handling could be moved into the engine if needed.

Circling back to the earlier comments the notion of a static still needs qualified - data modification events should still invalidate entries and or you may want lower scope/invalidation control options, etc.

> separately, it raises interesting questions about what it would mean to implement the caching system as a translator)

Yes, if your resutls are small, then you can consider integrating/managing cached forms of the results on your own to fully control the lifecycle. Otherwise to use the built-in logic Teiid would have to expose more of the caching logic to the translator than just the current CacheDirective approach.

> This brings me to my second point: The engine ought to restart the execution when any (relevant) portion of the cache is invalidated.

This has several considerations. An enhancement could be added such that data modification events are also processed against active continuous queries, but that seems to go against the general flow (not to mention if a strict datanotavailable has been thrown). At the worst though with the current logic you would get the updated values when the next window is executed - although I can see from your perspective that if the current window has begun with utilizing cached results it would be nice to somehow restart it.

24. Re: Programatically invalid cache using cache API?

shawkins May 21, 2013 2:32 PM (in response to markaddleman)

I did respond to this with a somewhat lenghty response, but the forum filters are hard at work... At the very least it's a hard reminder not to use the word s t r e a m i n g. So if the moderation works the response will get approved otherwise I'll post something else later.

25. Re: Programatically invalid cache using cache API?

markaddleman May 21, 2013 2:59 PM (in response to shawkins)

I've learned to copy & paste my responses to notepad before trying to post.

26. Re: Programatically invalid cache using cache API?

shawkins May 21, 2013 3:11 PM (in response to markaddleman)

Yes I've done that a good number of times too, but it looks I can now moderate myself (and hopefully others) - see the above.

27. Re: Programatically invalid cache using cache API?

markaddleman May 22, 2013 9:38 AM (in response to shawkins)

The other is to be explict in the cache hint what subset you want to initially fetch

This got me thinking: There are two dimensions to the caching problem: subsetting columns and subsetting rows. Subsetting columns seems like a pretty straightforward problem to solve but subsetting rows is, like you say, very hard. Suppose we approach the caching problem as a partitioned materialized view problem instead. Each partition would be a separate transformation expression and is materialized separately from other partitions as client queries match the partition. The application could refresh (invalidate) whole partitions. The existing materialized view functionality is (obviously) a corner case of this approach: there is a single partition that can be refreshed.

Of course, this approach puts the onus on the application developer to direct queries against the materialized view instead of some query planner magic but, in practice, I don't think this is much different than the cache hint approach. In fact, I think the materialized view approach is a bit simpler that the cache hint approach. I can imagine some creative use of ALTER VIEW to dynamically redefine partitions as the application learns more about the data. Further, if Teiid allowed materialized views to be created on foreign tables (pretty sure there's a jira for that), the application could become very sophisticated in its cache management.

What do you think of this approach?

I'm not ignoring the rest of your post but I'm drawn into a last-minute firefight before our release and I wanted to put this idea into your head.

28. Re: Programatically invalid cache using cache API?

shawkins May 22, 2013 10:04 AM (in response to markaddleman)

> In fact, I think the materialized view approach is a bit simpler that the cache hint approach.

Yes, and allows for up-front design. However it seems that users always want the ability to tune things further on just the user queries. The other issue of a more specific scoping may eventually be generally available for materialized views anyway, which mitigates the query hint work as well. At this point I don't TEIID-2139 to be part of the 8.4 release.

> I can imagine some creative use of ALTER VIEW to dynamically redefine partitions as the application learns more about the data.

Using none-static partitioning would be fairly advanced. You would have to adjust both the parent partitioned view and the relevant materialized views (and possibly trigger the appropriate refreshes) all of which are unfortunately not coordinated. You would probably have to take a staging approach -

create part_mat_view as select ..., 1 as part_id from part1_1 union all select .., 1 as part_id from part1_2 ...;

where each partN_M is a specific materialized subset. Then issue an alter of using a different set of partitioned views: select ..., 1 as part_id from part2_1 union all select .., 1 as part_id from part2_2 ... From there you'd each alterernate would alternate back to part1 for the next alter and so on.

> Further, if Teiid allowed materialized views to be created on foreign tables (pretty sure there's a jira for that), the application could become very sophisticated in its cache management.

Yes there is a JIRA, but has not been prioritized given the simplicity of just adding a view.

29. Re: Programatically invalid cache using cache API?

markaddleman May 23, 2013 10:27 AM (in response to shawkins)

You would have to adjust both the parent partitioned view and the relevant materialized views (and possibly trigger the appropriate refreshes) all of which are unfortunately not coordinated.

Not sure I was clear in my earlier message. I meant changing view/table metadata object so that the selectTransform is an array. The refreshMatView stored proc would be overloaded to add an additional parameter specifying the particular transform expression (ie, partition) to be invalidated. I don't think the CREATE VIEW DDL would have to change rather each piece of the UNION would map to a new element of the selectTransform array. Obviously, this doesn't address the coordination problem. I just want to be clear about the idea.

> Further, if Teiid allowed materialized views to be created on foreign tables (pretty sure there's a jira for that), the application could become very sophisticated in its cache management.

I forgot about external materialization. I think that is fine for this use case.

However it seems that users always want the ability to tune things further on just the user queries.

Fair enough. I was hoping to change the problem enough to make it more tractable but if the solution misses the mark then there's no point.

What you are describing implies that you only want to cache static results and that windowed results would be left to throw a DataNotAvailableException if they haven't hit their window start. A global source hint or execution payload can be used to communicate a window to all sources, but there's still no built-in handling implied. Alternatively some notion of windowing could be added to the RequestOptions and some basic handling could be moved into the engine if needed.

Yes, coordinating windows across multiple data sources is exactly the problem I'm trying to solve when I say I'd like to switch execution restart behavior from all data source indicating data available to any data source indicating data available. At this point, I need to thank you for walking me through the caching discussion. Originally, I had planned to implement this window coordination mechanism in a delegating translator. As I was thinking through it, though, I thought Teiid's caching mechanism wasn't up to the task (hence the original request). The key for me is that we can expire a cache through either of two mechanisms:

Synchronous to the query since getCacheDirective() is always consulted
Asynchronous to the query using the EventDistributor

If Teiid provided a coordinating mechanism, that would be convenient for us but I no longer think it's necessary. My rough plan for the coordination looks like:

Maintain a global map of request id to set of executions
Delegating translator that catches DataNotAvailable exceptions and puts the appropriate request id and execution into the global map then rethrows DNA.
On a continuous execution, the delegating translator normally returns session cache scope from getCacheDirective. The delegating translator can be put into an invalidate-cache mode in which it returns cache scope none exactly once then returns to its normal mode.
Introduce some mechanism to intercept dataAvailable() calls (like a proxy for ExecutionContext)
When the delegate translator calls dataAvailable(), the delegating translator is set to a mode to return cache scope none from getCacheDirective exactly once and then returns to normal behavior. Also on dataAvaliable(), consult the global map to call dataAvailable on all associated executions - normal getCacheDirective behavior here.

Question: If a translator asynchronously invalidates its cache using the EventDistributor, an execution will be created on the next query regardless of what the previous getCacheDirective returns?

I'm not sure how to handle date/time based DNA. In this case, it might be helpful if the execution context indicated the reason the execution is restarted - If restarted because of timeout, the delegating translator would return cache scope none.

Completely separately, it would be nice if Teiid would consult the cached results in different query contexts but that's outside our immediate needs.