1 2 Previous Next 21 Replies Latest reply on Nov 21, 2013 6:01 PM by markaddleman

    Multiple columns for the multisource discriminator column.

    markaddleman

      We're taking a holistic view of the performance of our application (hence all the earlier questions about caching) and one of the issues that pops up repeatedly is more intelligently routing queries from Teiid to sources.  We have several instances where we could use multi-source models but the discriminating data between sources are across multiple columns.  Partitioned unions offer the intelligent query planning that we want but I like the dynamism offered by multi-source models from [TEIID-2527] Dynamically change the sources in a multi-source model - JBoss Issue Tracker.  The upshot is a request to allow multiple columns as the discriminator.

        • 1. Re: Multiple columns for the multisource discriminator column.
          shawkins

          Neither multi-source nor partitioned unions understands multi-column partitioning. 

           

          Iit would be straight-forward to enhance the partitioning logic to look for array expressions in the where clause:

           

          SELECT ... WHERE (x.e2, x.e3) = (1, 'a')

          union all

          SELECT ... WHERE (y.e2, y.e3) = (2, 'a')

          ...

           

          Beyond that and we'd likely have to introduce a more formal metadata construct for telling Teiid the partitioning information.

           

          With the current design of multi-source, the multi-source element is an internal construct - so there isn't a concept of external discriminating data.

           

          Which of these directions were you thinking?

          • 2. Re: Multiple columns for the multisource discriminator column.
            markaddleman

            Ideally, I'm thinking an improvement to the multi-source feature so that would mean more formal metadata.  Again, the big advantage of that is the ability to add/remove sources dynamically.  I'm not sure what you mean by "the multi-source element is an internal construct - so there isn't a concept of external discriminating data" so I'm not sure how that affects the difference in the partitioned union approach versus multi-source.  I guess my question comes down to, does/would Teiid treat the multi-source model as a partitioned-union when it comes to query planning?

            • 3. Re: Multiple columns for the multisource discriminator column.
              shawkins

              > so I'm not sure how that affects the difference in the partitioned union approach versus multi-source.

               

              With multi-source the partitioning column is effectively forced upon the user.  It represents the source names as defined in the vdb and the engine will plan around and substitute those names as appropriate.  There is no notion that the data itself has a natural partitioning column.

               

              > I guess my question comes down to, does/would Teiid treat the multi-source model as a partitioned-union when it comes to query planning?

               

              It's not exactly the same.  I've tried to ensure that all the same logic applies to each, but in the multi-source case we do not create the intermediate unioned planning structure - that is done after optimization (and with TEIID-2527 is actually done at processing time).

              • 4. Re: Multiple columns for the multisource discriminator column.
                markaddleman

                Here are the some relevant facts about our use case:

                • Each partition represents a physical data source and each physical source is uniquely identified through one or more columns
                • We'd want the physical data sources to become part of the VDB when they are activated and drop out of the VDB when the are deactivated
                • Physical data sources activate & deactivate infrequently
                • These physical data sources know their partition key when they are activated

                 

                Here's what I understand of the trade-offs of the two approaches that you outline:

                • The multi-source model approach requires that we know the partition key at the time Admin.addSource is called.  While this could be inconvenient, I don't think it presents a big problem.
                • The partitioned-union approach requires us to redeploy the VDB every time a physical source is activated or deactivated.  Not a big deal since it occurs infrequently.
                • The partitioned-union approach requires us to enhance our query generation capability.  This isn't a big problem.
                • The partitioned-union approach potentially allows us more data-driven partitioning and to defer the partition naming to query time.  While I like the flexibility, I'm not sure we need it.
                • The partitioned-union approach would require multiple models all having identical metadata.  How does that affect memory costs?

                 

                How about this wild idea:  What if Teiid had a new model type, a shared-metadata model?  At VDB deploy time, a regular model is supplied.  At runtime, a new model could be deployed that shares its metadata with the deploy-time model.  The new model would be a separate named object within the VDB thus could be addressed specifically or, to emulate the existing multi-source model feature, it could be part of a partitioned-union.  Perhaps with this approach, we would gain all the advantages of multi-source (dynamic add/remove and lower memory cost) with the advantages of a partitioned union (deferring the partition naming to query time).

                • 5. Re: Multiple columns for the multisource discriminator column.
                  rareddy

                  These requirements seem some what similar to of distributed database, where a participating partitioned node comes in and out dynamically and has more resilience in terms of data availability when nodes fail. I thought partitioned tables more dependent upon the partition key (row key) data ranges that they host, than key definition itself. If key defined at a later time, does't data need to be redistributed among the nodes?

                   

                  As per using the common metadata model or not it really depends upon how the table is sharded by rows or columns. The current multi-source model does share the metadata model. I guess Teiid can use row based sharding, then use views on top to mimic column based to define either scenario.

                   

                  This is interesting topic, we should definitely improve our capabilities in this space.


                  Ramesh..

                  • 6. Re: Multiple columns for the multisource discriminator column.
                    markaddleman

                    > These requirements seem some what similar to of distributed database,

                     

                    That's an interesting observation and I think correct.  Our use case isn't about distributed databases since our sources all have distinct data sets which is why I'm happy with (hopefully) relatively simple extensions to the existing feature set.  Speculating a bit on distributed database storing redundant:  The cost of accessing data from one node may be significantly different than the cost on another node (I'm thinking of a lazy hierarchical storage system).  Some of the use cases get crazy complicated and I can see needing to either (1) open up the Teiid planner to allow the applications to influence the query execution or (2) moving all this problem up to the query level and making it the application's problem.

                     

                    Somewhat off-topic but related to #2:  In our situation, we are growing more sophisticated logic to generate queries.  In some cases, we issue queries to Teiid in order to build the query to answer user questions.  Unfortunately, SQL isn't really sophisticated enough to concisely express these meta-query influenced queries.  As a result, we have a paper-clip and bailing wire solution using Velocity in some cases and Java code that manipulates the Teiid language objects in other cases.  As a simple but illustrative example, we have a bunch of tables that are unioned together to form a data set.  We don't know what the set of tables is a-priori so we write Velocity scripts like this:

                    SELECT *

                    FROM (

                         #foreach($t in $db.query("select SchemaName, Name from sys.tables where Name='magic_table_name' and SchemaName like 'magic_schema_pattern'"))

                              SELECT * FROM ${t.SchemaName}.${t.Name} AS T

                              #if( $velocityHasNext ) UNION #end

                         #end

                    ) AS T1

                     

                    My point is that as data partitioning schemes becomes more sophisticated, I see the need to involve application semantics in the query execution somehow and SQL is a bad solution for this sort of thing.

                     

                    Anyway, this is unrelated to my more immediate problem:  multi-source or partitioned-union that supports multi-column partitions

                    • 7. Re: Multiple columns for the multisource discriminator column.
                      shawkins

                      The template above will work just fine using a procedure an dynamic sql.  Presuming that the template is actually defining a view, then there is too much of a performance hit for encapsulating as a procedure. 

                       

                      > Each partition represents a physical data source and each physical source is uniquely identified through one or more columns

                       

                      In multi-source the sources are partitioned by source name, so user queries need to be aware of the source names and use them appropriately to pick partitions or subsets.  Deferring this to the data does not make sense with the current design.  Can you give an example of how the multiple column partitioning would work?  And then is it for application level reasons why you wouldn't want to map that to the multi-source source name construct? For example by making the source name represent data via a simple concat: "part1-part2"

                       

                      > The partitioned-union approach would require multiple models all having identical metadata.  How does that affect memory costs?

                       

                      The memory requirement is increased, but we do attempt to canonicalize most of the string metadata so the footprint isn't quite as high as fully duplicate entries.

                      • 8. Re: Multiple columns for the multisource discriminator column.
                        markaddleman

                        > The template above will work just fine using a procedure an dynamic sql.  Presuming that the template is actually defining a view, then there is too much of a performance hit for encapsulating as a procedure.

                         

                        We used to expose these velocity templates as tables/views backed by a custom translator which would resolve the template and connect back into Teiid to execute the query.  We have recently switched to the following pattern:

                        1. At startup, we define views using DDL models of the form CREATE VIEW v AS SELECT null, null, null
                        2. Once the VDB is ready, we execute the velocity script and issue ALTER VIEW v AS <resolved velocity template>
                        3. Periodically, we recheck the resolved velocity script.  If it's different than before, we re-issue the ALTER VIEW

                         

                        This gets us the best of both worlds, I think.

                         

                        > Can you give an example of how the multiple column partitioning would work?

                         

                        Sure.  We might have a table representing the results of the Linux 'top' command.  In a cloud environment, we would have a collector running on each VM makes performance data, such as from top, available through Teiid tables.  To make the example somewhat contrived but realistic, we might have multiple groups of VMs where each group is uniquely identified by customer-id and within each group, we require that VM be uniquely named.  The source key is [customer-id, machine-name] because that uniquely identifies each collector running on each machine.  Each collector knows its customer-id and machine-name so it can publish fully-qualified data. 

                         

                        We want to push queries down to the collectors rather than publish collector data to a central repository thus the importance of routing queries intelligently to collectors that match the user's specified customer-id and machine-name.  We could introduce filtering intelligence into the translator but I'd rather not.

                         

                        > And then is it for application level reasons why you wouldn't want to map that to the multi-source source name construct? For example by making the source name represent data via a simple concat: "part1-part2"

                         

                        We could do that but it introduces complexity either on the user or on the application query generator.  Customers often use naming conventions to subdivide their VMs, for example prefixing the name of each machine as DB* if it's a database or JBOSS* if it's a jboss server.  So, if the user wants data from all of their databases, SELECT * FROM t WHERE customer-id='xyz' AND machine-name LIKE 'DB%', the application would have to disentangle the source column.  This isn't impossible if we use XML in the source column and make clever use of the XML query facilities but it is rather complex.  

                         

                        Perhaps a simpler approach would be maintaining a join-table of [customer-id, machine-name, source].  The database query from above would be represented as SELECT source-xref.customer-id, source-xref.machine-name, t.* FROM t JOIN source-xref ON top.source=source-xref.source WHERE source-xref.customer='xyz' AND source-xref.machine-name LIKE 'DB%'.  With appropriate query hints, I think that we can ensure that Teiid hits the source-xref table first to resolve all of the sources and then normal multi-source processing would handle the query routing.

                        • 9. Re: Multiple columns for the multisource discriminator column.
                          markaddleman

                          > This isn't impossible if we use XML in the source column and make clever use of the XML query facilities but it is rather complex. 

                           

                          It occurs to me that we could create a view on top of each multi-source table that disentangles the source-name and projects the multi-column equivalent.  This would keep our query generator logic simple but it would require a view for each table.  Maybe a clever application of a delegating translator could each the effort required to create the views...

                          • 10. Re: Re: Multiple columns for the multisource discriminator column.
                            shawkins

                            With a 2 part partitioning scheme and a source name like "<source customer-id='xyz' machine-name='db100'/>" you would have a view with a query such as:

                             

                            {code}SELECT xpathValue(SOURCE_NAME, '/source/@customer-id') AS customer_id, xpathValue(SOURCE_NAME, '/source/@machine-name') as machine_name ... FROM multisource.tbl{code}

                             

                            Then yes with the literal comparisons shown above, we would winnow the potential access nodes prior to execution.

                             

                            Beyond this we would have to substantially update the multi-source feature to support more than one multisource column - since the metadata and the logic are current geared toward just a single value.  Perhaps allowing for a non-string type (which could even be an array) would be a possible enhancement, then you wouldn't have to parse the value, just access array fields.  Or is the string representation is a little more predictable/restricted you could just use substring functions.

                            1 of 1 people found this helpful
                            • 11. Re: Re: Multiple columns for the multisource discriminator column.
                              markaddleman

                              > Perhaps allowing for a non-string type (which could even be an array) would be a possible enhancement, then you wouldn't have to parse the value, just access array fields

                               

                              I see this as a minor improvement as it is a tiny bit easier for the collectors to fill out arrays of data rather than form XML.  At this stage, I'm not interested in filing a JIRA for the enhancement, though.  We can work with the XML approach first and see how well it works.

                               

                              Thanks guys

                              • 12. Re: Re: Multiple columns for the multisource discriminator column.
                                shawkins

                                There is a new system function called tokenize, that we could promote to a properly documented function that takes a string (e.g. 'xzy/db100') and a delimiter and returns a string array:

                                 

                                tokenize(SOURCE_NAME, '/')[1] as customer_id, tokenize(SOURCE_NAME, '/')[2] as machine_id

                                 

                                and of course a udf could be used as well.

                                • 13. Re: Re: Multiple columns for the multisource discriminator column.
                                  markaddleman

                                  > There is a new system function called tokenize

                                   

                                  I'm not excited by using simple tokens to separate out columns because you end up having to be careful about the characters in the names themselves.

                                   

                                  On a somewhat different topic, I'd like to expose the Admin.addSource, removeSource, etc functions in a VDB to make it easier for our clients to add/remove data sources.  It looks like the Admin object is only available when Teiid is deployed in Jboss.  Is there some way to get at the underlying constructs without going through JBoss and security?  Although it isn't a big deal (we normally deploy in Jboss) but it's easier to test with embedded server and I'm thinking of implementing different security anyway - I think we'd want to secure acces to add/remove sources by model rather than a VDB-wide check.

                                  • 14. Re: Multiple columns for the multisource discriminator column.
                                    shawkins

                                    > Is there some way to get at the underlying constructs without going through JBoss and security?

                                     

                                    Not a good way.  The working deployment location is managed by JBoss and is not expected to be directly manipulated.

                                     

                                    > Although it isn't a big deal (we normally deploy in Jboss) but it's easier to test with embedded server and I'm thinking of implementing different security anyway - I think we'd want to secure acces to add/remove sources by model rather than a VDB-wide check.

                                     

                                    So this breaks down along:

                                    1. Is there a need for a more formal/pluggable deployment model for embedded - such that metadata caching and other server concerns are available there too.

                                    2. Should embedded expose an Admin interface?  Or at least allow access to some of the same operations in other ways?

                                     

                                    Really we can't do 2 unless 1 is in place as we are manipulating the vdb.xml files.  Or a compromise for testing would be to just allow the vdb runtime state to be updated without any notion of modifying a deployment.

                                     

                                    The thinking so far has been that anyone wanting advanced deployment management/admin features would be a user of the full server.

                                    1 2 Previous Next