11 Replies Latest reply on Dec 14, 2011 11:25 AM by shawkins

    Control over External vs Internal Materialization

    markaddleman

      I'll preface this by saying, I have not explored external materialization much.

       

      In order to know how best to materialize a view, we often need to examine something about the query context, either the query itself or some hint or payload data that lets us know the UI's intent for the data. In some instances, we will be generating large graphs of data that may be properly materialized in a graph database, for instance. Or, for the same data source, we may be doing more OLAP type processing where a relational database is the better backing store for the materialized view.

       

      My question is: Is there a way to plug into the Teiid materialization mechanism to examine the query, execution payload as well as the CREATE TABLE statement in order to more intelligently decide which is the proper backing store?

        • 1. Re: Control over External vs Internal Materialization
          shawkins

          External materialization just requires setting a materialization target on a particular view that is marked as materialized.  The target is typically a table in a source model.

           

          With the materialization target property set the engine redirects the query to the external target.  It is up to you to control how the data is loaded and refreshed in the external target.  In particular the cache hint and system procedures does not apply.  The basic approach is to generate a ddl from the view model and perform the initial load using INSERT INTO mat_target (....) SELECT * FROM view OPTION NO CACHE.  Refreshing can be as simple as a full refresh against an intermediate target and then swapping the live target with the intermediate one.

           

          At some point we'd like to take more of responsibility in maintaining external materialized views, but we've found that load/refresh strategies vary so greatly that something out of the box is rarely useful.  That is if you have performance considerations that go beyond built-in materialization, then you probably also need custom incremental refresh strategies and other considerations that would be hard for us to generally capture as a feature.

           

          As for using the execution context, I'm not sure that should come into play when considering view materialization.  It seems like you would know at design-time where the data should belong, rather than relying on something at execution time.  Are you referring more to the idea of temp table materialization on a particular source?  There is a long standing issue https://issues.jboss.org/browse/TEIID-196 along these lines that could be pushed up to the 8.x timeframe.

           

          Steve

          • 2. Re: Control over External vs Internal Materialization
            markaddleman

            I'm thinking about this problem through the lens of a particular use case:  We have data sources that produce very volatile data.  The primary keys are stable but the non-primary key columns change on a second by second basis.  Think hooking up something like linux top as a Teiid view.  Our users can use filtering to select a particular set of primary keys.  Once the set of primary keys are defined, we want to freeze the entire set of data so the user can perform various operations and receive consistent results.  For example, get a list of process ids, order by memory consumption, then order by CPU consumption, then order by memory consumption again.  The result of the third query should be the same as the first.  Initially, this gave rise to the desire to materialize a result set on the fly and direct subsequent queries against that result set to the materialized view.  It seems like using temp tables is another good way of achieving what is essentially read consistency within a transaction context.

             

            The complication comes that the user may want to see a graph of parent-child processes (this example is a bit contrived but I hope it gets the point across).  We will want to provide a frozen snapshot of data.  Suppose the graph is complex enough to warrant a graph database.  The only way I would know the appropriate materialization strategy is within the execution context of the query. 

             

            I could do this at a layer above Teiid and it may end up making sense to do that.  At this point, I'm trying to puzzle through the various implementation options and decide how best to proceed.

            • 3. Re: Control over External vs Internal Materialization
              markaddleman

              I read through https://issues.jboss.org/browse/TEIID-196 but it doesn't quite address my use case.  I can see the desire to have user defined materialization targets on a per data source basis (that's what I understand of the ticket) but I'm not familiar enough with EM to know how that differs from what is available today.  My ultimate desire is to define both the materialization strategy and the data materialized at query time.

              • 4. Re: Control over External vs Internal Materialization
                shawkins

                Creating a snaphot of a subset of data is a good use for temp tables.  There is some overlap with a materialization concept, but since this is on a per user query basis, it is not the same feature.  Materialization either internal or external is for what amounts to relatively fixed dataset defined by a view.

                 

                You are correct that the primary intent of TEIID-196 is to direct all temporary tables to a given source, which may not have much appeal (and isn't generally a good idea except for those that are federating against a single source or are making heavy use of a single external materialization target).  However the secondary intent of TEIID-196 is to be able to issue a CREATE LOCAL TEMPORARY TABLE statement with additional metadata, such as schema qualifying the table name to a particular source, so that the table resides on a source and not in the Teiid instance.  This gives fine grained control over where the data should reside, and should be the same thing you are trying to achieve by selectively shunting data to your graph db.  You would still have to populate any temp table(s) created on the graph db using relational insert statements which may be cumbersome.

                 

                A complicating factor is scoping like is touched on in https://issues.jboss.org/browse/TEIID-1822.  If the tables are session specific, but have the same structure regardless of the session, you could consider handing session awareness in a custom translator.  That is when the first query comes in accessing a given session scoped table you would create it as needed.  The missing piece to that approach, which admittedly pushes a lot of complexity to the custom translator, is that translators do not have a callback for when a session is removed.

                • 5. Re: Control over External vs Internal Materialization
                  markaddleman

                  I think I get it.  The new CREATE TABLE capabilities in TEIID-196 would be very important to us as we move toward data set snapshot functionality.  Notions of sessions are not very important, I think.  Currently, our client (a typical web app) uses connection pooling to connect to Teiid so Teiid's notion of a session isn't very helpful.  Instead, we would likely use global temp tables that the app associates with a user session and either explicit DROP TABLE semantics or a temp table timeout mechanism.

                   

                  This brings up a few questions in my mind:

                   

                      Is using a connection pool to connect to Teiid a good approach?

                      In the discussion of TEIID-1327 we had talked about an ability to manipulate the client SQL before it reaches the planner.  It seems like a general capability to insert a translator-like thing to manipulate client SQL would be a good solution to this and many other problems that we have.  I suppose we could define two VDBs in a pipeline:  The lower one would expose our data sources and the upper one would delegate requests through a TeiidExecutionFactory which could apply all of our SQL transformation rules (such as very fine-grained security, app level session management, temp table population, etc).  I wonder about the performance impact of this implementation?

                  • 6. Re: Control over External vs Internal Materialization
                    shawkins

                    Based upon your usage a session still seems like a useful concept, but you have the age old problem of how to integrate that with a web app.  A connection pool is usually sufficient, but depending on your usage you could do better. 

                     

                    Teiid sessions are light-weight, as are embedded connections.  Since I believe you are are using embedded connections in a non-clustered environment, you could just create and hold an embedded connection in your web session (you may have to wrap it in another object since it is not serializable).  This would probably eliminate the need to introduce global tables and would also even allow you to perform more efficient cursoring by just holding ResultSet objects over web operations rather than re-issuing a query with a different offset (note that there are some other approaches that will be applicable to cursoring shortly.  Such as the odbc cursoring commands will also be available through JDBC in Teiid 8 and at some point we will look into improving resultset caching to be aware of limits).

                     

                    This approach applies to remote clients as well.  In a remote client scenario we have the underlying code that allows for session reestablishment.  With an enhancment, web conversations could easily be sticky to a particular instance/session.  Altough reattaching a statement would be significantly more work for remote clients.

                     

                    As for modifying the incoming SQL, you could use the AuthorizationValidator for that purpose with the understanding that your using a non-public engine command object API.  Note that in 7.4 via https://issues.jboss.org/browse/TEIID-1326 you can issue alter statements to change views as needed.  The benefit of altering views is that your using standard ddl, we hold a cached resolved form for efficency, and you may not then need to modify your incoming sql.  If you don't need to manipulate the command, you should just plugin a PolicyDecider for security instead (assuming data roles are not sufficient).

                     

                    As for the performance of a two stage solution.  The overhead of query translation into translator commands, then to sql and back again shouldn't be too much of a concern especially if you are using prepared plans in your second stage.

                     

                    Steve

                    • 7. Re: Control over External vs Internal Materialization
                      markaddleman

                      Hi Steve -

                       

                      Picking this thread back up...

                       

                      I've been thinking about associating a connection with a user's web session as you suggested for the sole purpose of getting access to local temp tables across multiple user requests.  Cursoring is only a performance issue for us and it looks like using the LIMIT clause gets us good enough performance without complicating the code like holding onto result set resources.

                       

                      The only question now is related to threading.  Currently, our application is read-only so there are no commit scope issues.  My concern comes down to this: is there a problem sharing the embedded connection object across threads?  It is quite possible, for example, that the user will issue two simulataneous requests for data.

                       

                      In the somewhat distant future, we may introduce transaction semantics into our application but, by that point, I'm hoping Teiid gets glocal temp tables we can go back to a typical web-app model.

                      • 8. Re: Control over External vs Internal Materialization
                        shawkins

                        Mark,

                         

                        Connections and Statements are not thread-safe.  Different statements may be used by different threads, but a Statement also unfortunately still has some interaction with Connection state that is not thread-safe - namely when a statement/resultset is closed or a set/transaction statement is issued.  So you would want to use minimal syncrhonization to ensure proper concurrency.  You can also submit a JIRA for Statement usage from the same connection to be concurrent, since it is not a large effort.

                         

                        Steve

                        • 9. Re: Control over External vs Internal Materialization
                          markaddleman

                          As long as we can (1) issue concurrent PreparedStatement.execute*() calls and (2) iterate over multiple result sets from a single connection objects in a thread safe manner, the synchronization issues don't bother me too much.  Of course, it would be easier if they were entirely thread safe so I'll submit a ticket for that. 

                          • 10. Re: Control over External vs Internal Materialization
                            markaddleman

                            Our app uses prepared statements exclusively.  Is this a model of properly synchronizing the Connection object?

                             

                            Connection c = ...;

                            PreparedStatement p;

                            synchronized (c) {

                               p = c.prepareStatement(...);

                               p.setXYZ(...);

                               p.setXYZ(...);

                            }

                            ResultSet rs = p.executeQuery();

                            while (rs.next()) {}

                            synchronized(c) {

                               rs.close();

                               p.close();

                            }

                            • 11. Re: Control over External vs Internal Materialization
                              shawkins

                              Yes assuming that you aren't using transactions (which may also mean that you would want to change your autoCommitTxn to OFF), that synchronization would be sufficient.  There would be no issue with the queries executing in parallel.

                               

                              Steve