5 Replies Latest reply on Apr 10, 2013 2:50 PM by Steven Hawkins

    Possible additional optimization:  consolidate UNION against same data set

    Mark Addleman Master

      Our app generates a lot of queries like this:

      • SELECT a FROM t WHERE x='abc'
      • SELECT b FROM t WHERE x='abc'

       

      I'd like to execute these queries in parallel with

      SELECT a FROM t WHERE x='abc'

      UNION

      SELECT b FROM t WHERE x='abc'

       

      These queries could be rewritten as SELECT a,b FROM t WHERE x='abc' but the decision whether to process the client query as a union or a two column select,

       

      For some translators, processing the two, single column selects in parallel is going to be preferred and in other cases, processing the single query will be preferred.  The translators support union or not depending on the performance characteristics of the data source. 

       

      I'd like to propose a new Teiid optimization to combine queries within a union when they have the same data set (not sure of the proper term; I mean table + where clause) if the translator does not support union.  So, if a translator does not support union,

      SELECT a FROM t WHERE x='abc'

      UNION

      SELECT b FROM t WHERE x='abc'

      would be rewritten as

      SELECT a,b FROM t WHERE x='abc'

       

      Thoughts?

        • 1. Re: Possible additional optimization:  consolidate UNION against same data set
          Steven Hawkins Master

          This is covered more broadly under https://issues.jboss.org/browse/TEIID-2139 and is basically working via a uncommitted patch - but a similar test was stubbed out https://github.com/teiid/teiid/blob/master/engine/src/test/java/org/teiid/query/processor/TestCommonTableOptimizations.java

           

          > These queries could be rewritten as SELECT a,b FROM t WHERE x='abc'

           

          This builds on the shared tuple source logic because while the source query could rewriten to that, the results from there would still need to be fed into their respective parts of the plan - typically with post processing to match the original source query (projecting a subset, applying a specific filter, etc.). 

           

          It was not commited because the logic to implement it more fully - to share joined tables, distinct, multiple full scan predicates, etc. - was getting too expansive and in most typical case sources would simply support union/union all anyway.  This led me to want to pursue a more hint driven approach on views, but that was never completed either.

           

          Steve

          • 2. Re: Possible additional optimization:  consolidate UNION against same data set
            Mark Addleman Master

            Interesting, I didn't expect that kind of rewriting to fall under TEIID-2139 but as you mention it, I can see it.  One of our key use cases is would benefit from the specific rewriting case described above.  We could do it in our query generation layer, but it seems like leaking information from the translator into layers of the application where it doesn't belong though I wouldn't let purity keep us from implementing the rewriting because I expect the benefit to be pretty big.  I think we'll plan on implementing a fairly naive system in our query generator and see how TEIID-2139 progresses.   We can always delete our logic in favor of a more general mechanism when, and if, it becomes available.

             

            Thanks for the insight.  Can you elaborate a bit on what you envision for a "more hint driven approach on views?" 

            • 3. Re: Possible additional optimization:  consolidate UNION against same data set
              Steven Hawkins Master

              > Can you elaborate a bit on what you envision for a "more hint driven approach on views?"

               

              A hint to the optimizer to share a table/view appearing multiple times in the query would allow for shared results without the issue we currently have which is that the optimization process may create source queries that cannot effectively shared by a post optimization check.

              • 4. Re: Possible additional optimization:  consolidate UNION against same data set
                Mark Addleman Master

                Would there be a reason to avoid using the hint all the time?  To answer my own question, I suppose the semantics of certain data sources might make that undesirable such as a random number generator.  It seems to me the default ought to be to share results while a hint would be required to avoid the behavior.

                • 5. Re: Possible additional optimization:  consolidate UNION against same data set
                  Steven Hawkins Master

                  > Would there be a reason to avoid using the hint all the time?

                   

                  Yes. If you were to assume that any instance of a view / table were to have its results shared, that could dramatically alter the approach to optimization and processing.  Most importantly you would typically only want to share full scan access - and even then if there are conflicting limits or applicablity of distinct you can end up bringing back many more results than the queries would have individually.  You could only share index scans against the same index (e.g. just an IN against the same key).  And ff you share an index scan and full scan access, then you have effectively slowed the index access down to a full scan as well.  There are further complications as you would be creating a more serial plan (the greater parallalism of the non-shared case may be exploitable to reduce the latency of results), potentially buffering more resutls for a greater period of time than otherwise needed, and as you point out with determism.