3 Replies Latest reply on Sep 23, 2013 9:43 AM by shawkins

Teiid 8.4 performance(caching/query plan) questions

anilallewar Sep 19, 2013 5:03 PM

Hi,

I am looking at Teiid caching to improve the performance and also minimize the time taken for query planning. Our observation has been that for large datasets, the time taken by Teiid to return data from the source (there is 1-1 mapping between source and view model) is significantly higher than for small datasets for the same entity. I have also noticed that the 1st query for an entity takes a large amount of time (say 18 secs) while subsequent execution of the same query takes 3-4 secs with no result-set caching.

In order to address this problems, I am looking into Teiid caching/query plans and have some questions about how Teiid internally manages it.

1. As per the documentation, the source tuples (batch of data fetched) is stored in the buffer manager which handles the memory and disk storage and we have the result-set cache which stores the result set cache entries. So the question is what is exactly held in the resultset cache? Is it mapping from the exact query to the location of source tuples that hold data to the select query? If the select resulted in more than 1 batch (and hence multiple source tuples) how are these multiple source tuples matched in the resultset cache?

2. Based on the documentation it looks like the resultset cache is already turned on but query results are not cached by default; is this correct? So the way to turn on caching for resultset would be to add a hint in the query like "/*+ cache/Select * from A" or have the statement run the "set resultSetCacheMode true" query first. If each subsequent query requires the same hint, would running the resultSetCacheMode be a better option for resultset caching?

3. What is the preparedplan-cache used for and what is held at each node of the prepared plan cache? Is the sql statement the key and parsed query plan the value in the datagrid? This question will help us determine how to tune plan cache.

4. How is the scope of the cache determined by Teiid to be session/user/vdb? It can be over-ridden in the query hint but I don't really want to use query hint because it is not standard JDBC and doesn't lend itself to ORM solutions like JPA.

5. What are the implications of moving Teiid memory off the heap? How is the reference to previous source tuples swapped out in case the off heap memory becomes full?

6. Based on the architecture writeup; Teiid uses cursors for all data access. How do the cursors interact in caching? We use CursoredStream to hold a reference to large data sets and read data in batches from the CursoredStream. Will Teiid cache the results from the CursoredStream opened on the Teiid JDBC connection?

7. How is the invalidation of resultset cache handled when suppose that the user add/update/delete a row from the entity and this affects the resultset?

8. If we need to reduce the amount of time taken for query planning and make the first instance of the same query take the same time; would firing the same query with "set noexec on" be a valid option? Are there any alternatives to improving the query plan as the noexec assumes that I know the queries that I need to prepare plan for beforehand.

9. I am assuming that support for Infinispan in server mode would be defined by the Infinispan subsystem. Based on how the cache and buffer manager are used we will think about whether to use Infinispan in embedded mode or client-server mode.

Thanks in advance in reading through this long list of questions

Anil

1. Re: Teiid 8.4 performance(caching/query plan) questions

shawkins Sep 19, 2013 7:28 PM (in response to anilallewar)

> I have also noticed that the 1st query for an entity takes a large amount of time (say 18 secs) while subsequent execution of the same query takes 3-4 secs with no result-set caching.

Given no result set caching you wouldn't expect such a variation in timing. Are you using prepared statements? Have you examined the logs or tried first issuing a the query with no exec on to determine if the time is being spent in planning? Otherwise there's not much from the Teiid side that would account for the variation - there could be an issues with how long it takes to get connections from your pool etc. I think you need to understand exactly how the time is being spent before looking for optimizations, but I'll take a stab at answering your questions just in case.

> So the question is what is exactly held in the resultset cache? Is it mapping from the exact query to the location of source tuples that hold data to the select query?

The result set cache is used for user query result caching and for source query caching (using the CacheDirective). It is keyed by the query and several other factors, such as the vdb, user, and session depending upon the scope.

> If the select resulted in more than 1 batch (and hence multiple source tuples) how are these multiple source tuples matched in the resultset cache?

The storage unit of the result set cache is a batch, but they are tracked as a unit. No single batch is purged from all layers of cache until the entire result is purged.

> Based on the documentation it looks like the resultset cache is already turned on but query results are not cached by default; is this correct?

Yes, since we are not aware of all data changes (as the sources can be accessed outside of Teiid) we cannot assume results should always be cached.

> So the way to turn on caching for resultset would be to add a hint in the query like "/*+ cache/Select * from A" or have the statement run the "set resultSetCacheMode true" query first.

Yes.

> If each subsequent query requires the same hint, would running the resultSetCacheMode be a better option for resultset caching?

Not necessarily. Setting the resultSetCacheMode executionProperty does not give you any of the options that are available with the cache hint.

> What is the preparedplan-cache used for and what is held at each node of the prepared plan cache? Is the sql statement the key and parsed query plan the value in the datagrid? This question will help us determine how to tune plan cache.

A prepared plan is the just as you would expect. It the processing plan generated by the optimizer for some given sql. The key is primarily the sql, but also must be tracked by vdb and possibly the session. At this time prepared plans are not distributed/replicated across a cluster as the plan is not serializable. I'm not sure how much tuning of the plan cache you'll need to do. It's more of a question of when should you be using PreparedStatements which are primarily responsible for creating a prepared plan entry - although if you are using CallableStatements or in general calling stored procedures then we are likely creating prepared plans under the covers.

> How is the scope of the cache determined by Teiid to be session/user/vdb? It can be over-ridden in the query hint but I don't really want to use query hint because it is not standard JDBC and doesn't lend itself to ORM solutions like JPA.

The scope is determined by the sql. By the functions called, session tables referenced etc. It needs to be overriden though because we currently do not know if for example you're using a source that will return different results based upon the user. The hint is just fine from a JDBC perspective (since it's just a sql comment), but yes it may be difficult to express correctly through JPA.

> What are the implications of moving Teiid memory off the heap? How is the reference to previous source tuples swapped out in case the off heap memory becomes full?

You have to first understand the memory layers in Teiid. There is the heap layer - this is the estimated heap memory held by the buffermanager (in dual queues) and potentially in use by plans. We'll proactively and based upon memory pressure copy/move values to a fixed memory buffer. At this point the batches have a know (and generally smaller) size. This fixed memory buffer layer can be either on or off heap. Generally off-heap will be better performance when you have volumes of data buffered through Teiid - as you will avoid GC sweeps through a large amount of memory and on some platforms can allocate more memory than you generally can to your java process as heap memory. Entries that are evicted from the fixed memory buffer spill to disk.

> Based on the architecture writeup; Teiid uses cursors for all data access. How do the cursors interact in caching?

I'm not sure what you are asking. The general principle is that we want to avoid buffering data whenever possible. Thus as much as data can be cursored/streamed in batches we'll do it. When you cache you generally end up with a copy of the entire results and then cursor on whichever batch is being accessed.

> We use CursoredStream to hold a reference to large data sets and read data in batches from the CursoredStream. Will Teiid cache the results from the CursoredStream opened on the Teiid JDBC connection?

I'm not familiar with CursoredStream. How specifically does it interact with the client?

> How is the invalidation of resultset cache handled when suppose that the user add/update/delete a row from the entity and this affects the resultset?

Changes are tracked to a table not row level, but of course by default we'll only know about changes issued through Teiid. From there the cache is configured with a max staleness setting to say how long you want entries to stay in the cache even after they are stale. There is an api to make Teiid aware of source changes outside of Teiid.

> If we need to reduce the amount of time taken for query planning and make the first instance of the same query take the same time; would firing the same query with "set noexec on" be a valid option? Are there any alternatives to improving the query plan as the noexec assumes that I know the queries that I need to prepare plan for beforehand.

You would use the noexec (or a detailed log) to confirm the time spent in planning, but ideally you would not use that as a mechanism to warm the prepared plan cache. If you can provide more details on your scenario, then we could recommend next steps.

> I am assuming that support for Infinispan in server mode would be defined by the Infinispan subsystem. Based on how the cache and buffer manager are used we will think about whether to use Infinispan in embedded mode or client-server mode.

It's more of a question of what benefit do you expect from the replication of results. Infinispan in the context of this discussion is mainly being used to replicate the keys for vdb/user result set caching. The result set cache data is replicated on demand directly through jgroups.

> Thanks in advance in reading through this long list of questions

No problem. They're good questions. Hopefully we'll get your issue addressed without too much hassle.
Actions
2. Re: Teiid 8.4 performance(caching/query plan) questions

anilallewar Sep 20, 2013 5:42 PM (in response to shawkins)

Thanks Steven!!

I'll work more on analyzing the plans and ask more questions as needed. Based on your feedback, I have a few more questions.

1. I still couldn't get as to what is stored as the value for each key in the resultset cache. Is it the whole batch of results or just a pointer to the batch stored in BufferManager? My assumption would be that the BufferManager memory is separate from the one used by Infinispan for the cache; is that correct?
2. From a prepared statement perspective, if we have more concurrent users and the application uses prepared statement would it be fair to say the size of the preparedplan-cache should be increased to cater to concurrent users and also keeping the plan for most SQLs in memory?
3. What is the default scope of the cache session/user OR vdb? Is there a way to specify the default scope or the scope of the query without using query hints?
4. With reference to Teiid cursors, my assumption is that Teiid first prepares all the results in the BufferManager (after source joins/transformations etc) and then these result batches are streamed to the client. Is this assumption correct?
5. The CursoredStream is an implementation provided on EclipseLink JPA that reads the resultset from the JDBC query and provides that in batches to the calling program. http://www.eclipse.org/eclipselink/api/1.1/org/eclipse/persistence/queries/CursoredStream.html. In our case the CursoredStream in built on top of data received through Teiid JDBC driver and the socket connection to Teiid will be kept open till the cursor is being read.
6. I am assuming that when client creates a new object for say entity A and we have results for entity A in cache, those results won't be invalidated automatically. Would it be fair to say that if I set the max staleness parameter to 0 or -1, the resultset would be evicted from the cache immediately? Is there a programmatic way to invalidate that entry and remove the resultset from the BufferManager?
7. There is an api to make Teiid aware of source changes outside of Teiid. => I would be very interested in knowing more details about this as part of another use case. Can you please point me to the relevant documentation?
8. The result set cache data is replicated on demand directly through jgroups. => Is this the replication of BufferManager?

Thanks again!!

Anil
Actions
3. Re: Teiid 8.4 performance(caching/query plan) questions

shawkins Sep 23, 2013 9:43 AM (in response to anilallewar)

> I still couldn't get as to what is stored as the value for each key in the resultset cache. Is it the whole batch of results or just a pointer to the batch stored in BufferManager? My assumption would be that the BufferManager memory is separate from the one used by Infinispan for the cache; is that correct?

The value is effectively a reference to a memory structure (TupleBuffer) that holds keys for each of the batches in the result that are stored by the BufferManager. Infinispan only handles the caching/replication of the keys, since at the time the integration work was performed Infinispan did not provide good handling for what would be effectively large cache values such as would be the case here with the entire result.

> From a prepared statement perspective, if we have more concurrent users and the application uses prepared statement would it be fair to say the size of the preparedplan-cache should be increased to cater to concurrent users and also keeping the plan for most SQLs in memory?

It's not really a question of the query load per se, unless you are using prepared statements for all sql. The rule of thumb would be to use a prepared statement for any repeated sql. If you make extensive use of stored procedures and you have an estimation of the total distinct queries that will be issued via prepared statements and that vastly exceeds the default cache size, then yes you should increase it - but the memory footprint of plans can be on the order of 10KB or more.

> What is the default scope of the cache session/user OR vdb? Is there a way to specify the default scope or the scope of the query without using query hints?

VDB is the default for user query result set cache entries. There is no current mechanism to change the default beyond the hint.

> With reference to Teiid cursors, my assumption is that Teiid first prepares all the results in the BufferManager (after source joins/transformations etc) and then these result batches are streamed to the client. Is this assumption correct?

Not unless the query is issued under a transaction. If not, then we will avoid buffering when possible.

> Would it be fair to say that if I set the max staleness parameter to 0 or -1, the resultset would be evicted from the cache immediately? Is there a programmatic way to invalidate that entry and remove the resultset from the BufferManager?
> There is an api to make Teiid aware of source changes outside of Teiid. => I would be very interested in knowing more details about this as part of another use case. Can you please point me to the relevant documentation?

If you issue an update through Teiid and the staleness is 0, then yes affected entries will be purged immediately. The programatic way to invalidate based upon data changes it to use the EventDistributor.dataModification method. It's touched on in Internal Materialization - Teiid 8.6 (draft) - Project Documentation Editor and Programmatic Control - Teiid 8.6 (draft) - Project Documentation Editor

> The result set cache data is replicated on demand directly through jgroups. => Is this the replication of BufferManager?

It's the selective replication of result set cache TupleBuffers, which yes are stored in BufferManager.
Actions

Go to original post