11 Replies Latest reply on Aug 20, 2012 3:55 PM by shawkins

    Federated polymorphic foreign keys

    markaddleman

      A thought for weekend hammock time:

      Our application has a pattern that I expect is somewhat common across apps based on Teiid:  We have several tables across different data sources that share a common set of columns.  We union these tables together in a view which shows the common columns as well as a column containing a literal indicating the underlying table name. The key problem (pun intended) is the primary key structure varies between the underlying tables.  Currently, the view reforms each primary key as an XML document in a single column. 

      The problem with this approach is we have tables that have a foreign key into the view.  These tables, obviously, encode the foreign keys in the same XML format as the view.  Unfortunately, Teiid cannot push these join queries down to the underlying data source so is pretty inefficient.

       

      This got me thinking… What if Teiid introduced a new type (perhaps represented as XML) so it could innately understand the underlying column structure.

      For example, suppose I have two tables T1 with primary key structure [a] and T2 with primary key structure [b,c].  I imagine I could union these together in a view defined as

      create view V (
      pkey SPECIAL_TEIID_KEY_TYPE
      ) as
      select createSpecialKeyType(a) from T1
      union
      select createSpecialKeyType(b,c) from T2

       

      Now, I could create a table, T3, with a foreign key into V.  I imagine I could join T3 and V with a straightforward query:  select * from T3, V where T3.fkey=V.pkey

      It seems to me that Teiid has a good shot at encoding enough information into values of this new type so that the planner could execute two queries. 
      T1:  select * from T3, T1 where extractValue(T3, a)=T1.a
      T2:  select * from T3, T1 where extractValue(T3, b)=T2.b and extractValue(T3, c)=T2.c

       

      Thoughts? 

        • 1. Re: Federated polymorphic foreign keys
          shawkins

          Hi Mark,

           

          There are a couple of issues here.  Let's start with the join push-down.  When using partitioned views, e.g.

           

          create view parent as select '1' as source, x, y from a union al select '2' as source, x1, y1 from b;

          create view child as select '1' as source, x, y from c union al select '2' as source, x1, y1 from d;

           

          The optimizer should automatically break down the join "select * from parent, source where parent.source = child.source" as the union all of constituent joins (a x c) and (b x d).  Any additional join or other predicates will push-down appropriately.

           

          Using XML to emulate a struct/complex type is appealing, but being non-comparable makes it unsuitable as a key type and as you are getting at would require some special semantics and join handling.  The simplistic workaround would be to make a compound pk as the super-set of the literal column and all of the respective pk columns involved in the union.  In your example that would make the union primary key the source column, a, b, and c.  Any table that has a foriegn key would then need to match that number of pk columns and project nulls for any missing values. 

           

          Steve

          • 2. Re: Federated polymorphic foreign keys
            markaddleman

            re partitioned unions:  Understood.  We could gain some if our current implementation of primary key structure used two columns, one for the type-name (constant) and the other is XML encoded of the primary keys. 

             

            re using a superset of columns:  I'll have to give this some thought.  One of my thoughts about the special data type is I could write a data source + translator that understood the data type and store it directly (a graph database is appealing here).  Perhaps I can achieve the same thing by a superset of columns.  It is certainly less convenient.

             

            As it stands right now, our existing approach to the problem is adequate (combination of XML, special application logic to form proper queries and some special handling in data soruces).  We aren't suffering from any performance problems yet...

             

            I'm thinking about alternative implementations:  Can stored procedures or functions take variable number of input parameters?  I'm toying with an idea that we could create a function that would take in the literal and primary key columns and return our XML form.  Using a function would only make our queries a little easier to read.  Alternatively, if stored procedures can take variable parameter list, then we could write a stored procedure to take in a fully-qualified primary key, look up the corresponding row from the appropriate table (that should be fast since the corresponding query could be pushed down) and return the result in XML form.

            • 3. Re: Federated polymorphic foreign keys
              shawkins

              Functions can have variable arguments, see FunctionParameter.setVarargs.  However the stored procedure resolving logic does not currently allow it.  XMLFOREST may also be of some interest.

               

              Steve 

              • 4. Re: Federated polymorphic foreign keys
                markaddleman

                The simplistic workaround would be to make a compound pk as the super-set of the literal column and all of the respective pk columns involved in the union.  In your example that would make the union primary key the source column, a, b, and c.  Any table that has a foriegn key would then need to match that number of pk columns and project nulls for any missing values.

                I don't see this as a practical solution for a large number of tables.  We have hundreds of tables (and growing) each of which is potentially a related to a a few other tables through a foreign key.  Projecting all of the keys would yield, potentially, a few thousand columns.  I fear, at some point, we run into some physical limits (memory, stack space...)

                 

                Absent a new Teiid type, I think the best solution is to XML-ify or stringiify the primary keys to collapse them to a single column. 

                • 5. Re: Federated polymorphic foreign keys
                  markaddleman

                  I've been reading through the Teiid 8 docs (on a side note, the docs are excellent!  I really appreciate the descriptions of the internals) looking for ideas to help with the polymorphic foreign key problem.  Here's what approach that I've come up with that I'd like to run by you.  Again, the overall goal is to achieve federated polymorphic foreign keys with good pushdown support and as little custom translator work as possible.

                   

                  Suppose we created two functions functions like:

                  CREATE VIRTUAL FUNCTION fk(source_table_name string, dest_table_name string)
                  RETURNS object
                  OPTIONS (JAVA_CLASS ..., JAVA_METHOD ..., VARARGS 'true', DETERMINISM <not sure yet> )
                  CREATE VIRTUAL FUNCTION pk(table_name string)
                  RETURNS object
                  OPTIONS (JAVA_CLASS ..., JAVA_METHOD ..., VARARGS 'true', DETERMINISM <not sure yet> )
                  CREATE VIRTUAL FUNCTION convert_to_universal_key_representation(table_name string, value object)
                  RETURNS object
                  OPTIONS (JAVA_CLASS ..., JAVA_METHOD ..., VARARGS 'true', DETERMINISM <not sure yet> )

                  (I think I'm misunderstanding something in the docs:  I don't know how to declare the function as pushdown allowed.  Also, I'm not sure what happens if pushdown is required in the case where the query involves two data sources joined on the result of the function applied to each data source.)

                   

                  The key here is that the function returns some arbitrary Java object with well-defined a equals() method.

                   

                  To ensure that all translators support the fk function, we would use a delegating translator which declares the fk function in the getMetadata() and intercepts its use in createExecution():  The command expression would be rewritten to replace the call to fk with some rewritten form based on metadata.

                   

                  In the examples that follow, t1 contains a 'col1' which is an internal representation of a polymorphic foreign key.  t2 is a table with primary keys pk1 and pk2. t3 is a table with traditional foreign keys into t2, fk1 and fk2.

                   

                  Case 1:  Join t1 and t2, translator supports joins

                  1. Client query: SELECT t1.* FROM t1 JOIN t2 ON fk('t1', 't2')=pk('t2')
                  2. Teiid pushes entire query down to the translator
                  3. Delegating translator rewrites query as:  SELECT t1.* FROM t1 JOIN t2 JOIN t1.col1=internal_representation_pk(t2.pk1, t2.pk2)

                   

                  The implementation to generate the internal representation would be pluggable and driven by some metadata (translator property?)

                   

                  Case 2: Join t1 and t2, join cannot be pushed down

                  1. Client query: SELECT t1.* FROM t1 JOIN t2 ON fk('t1', 't2')=pk('t2')
                  2. Teiid breaks the query up into:
                    1. SELECT t1.*,fk('t1', 't2') FROM t1
                    2. SELECT pk(pk1, pk2) FROM t2
                  3. Delegating translator rewrites queries as:
                    1. SELECT t1.*, convert_to_universal_key_representation('t1', col1) FROM t1
                    2. SELECT convert_to_universal_key_representation('t2', pk1, pk2) FROM t2
                  4. Teiid performs join as normal.

                  Questions:

                  1. Is there a problem if the internal representation is an arbitrary, serializable Java object?  (assuming it has the proper equals/hashCode/compare semantics)
                  2. Assuming convert_to_univeral_key_representation is fast, does Teiid join these data as efficiently as traditional column approach?
                  3. It seems like some source-specific information is necessary to know how to perform convert_to_universal_key_representation('t1', col1) such as knowing that col1 is already in a source-specific internal representation and how to convert that to universal representation.

                   

                  Case 3: Join t2 and t3, join pushed down

                  1. Client query: SELECT t3.* FROM t3 JOIN t2 ON fk('t3', 't2')=pk('t2')
                  2. Delegating translator rewrites query as: SELECT t3.* FROM t3 JOIN t2 ON fkey1=pk1 AND fkey2=pk2

                   

                  Case 4: Join t2 and t3, translator does not support joins

                  1. Client query: SELECT t3.* FROM t3 JOIN t2 ON fk('t3', 't2')=pk('t2')
                  2. Teiid splits query into:
                    1. SELECT t3.*, fk('t3', 't2') FROM t3
                    2. SELECT pk('t2', pk1, pk2) FROM t2
                  3. Delegating translator rewrites queries as:
                    1. SELECT t3.*, convert_to_universal_key_representation('t3', fkey1, fkey2) FROM t3
                    2. SELECT convert_to_universal_key_representation('t2', pk1, pk2) FROM t2

                   

                  Functions fk and pk can be provided by the delegating translator.  fk needs to know foreign key relationships and if a column is in an source-specific, polymorphic fk format.  Information about foreign key relationships is available from Teiid metadata.  Translator properties could supply information indicating which columns are in source-specific formats.

                   

                  What do you think?  If you think this is a workable approach, I'm going to start on it.  My intention will be to design it for generic use and, if the implementation turns out well, I'd love to donate it back to the community.

                  • 6. Re: Federated polymorphic foreign keys
                    shawkins

                    > I think I'm misunderstanding something in the docs:  I don't know how to declare the function as pushdown allowed.  Also, I'm not sure what happens if pushdown is required in the case where the query involves two data sources joined on the result of the function applied to each data source.

                     

                    The DDL approach makes a simplifying assumption that either you want "CAN_PUSHDOWN"/VIRTUAL or a "MUST_PUSHDOWN"/FOREIGN function - although I see that the grammar is missing the expectation of using the FOREIGN keyword.  There is not an option to create a "CANNOT_PUSHDOWN" function.

                     

                    If you have a MUST_PUSHDOWN function and it cannot be evaluated at the source, then execution will fail.  So if it's possible that you want to run the funtion in Teiid, then create it as a VIRTUAL function.

                     

                    > To ensure that all translators support the fk function, we would use a delegating translator which declares the fk function in the getMetadata()

                     

                    Since you are adding fk as a single virtual function you would actually want to declare support for it in the getSupportedFunctions list.  Adding a function into getMetadata would create a bunch of distinct functions.  Here also there is a refinement in Teiid that is needed in 8.1, since we want non-system functions to be checked in the supported list by full name, rather than just their short name.

                     

                    > Questions:

                    > Is there a problem if the internal representation is an arbitrary, serializable Java object?  (assuming it has the proper equals/hashCode/compare semantics)

                     

                    Yes.  The type system will not allow you to use Object as a comparable type.  You would be unable to use it in equality, sorting, grouping, etc.  XML is also a problem.  The quickest workaround is to use a string or clob (with clob comparison enabled).

                     

                    > Assuming convert_to_univeral_key_representation is fast, does Teiid join these data as efficiently as traditional column approach?

                     

                    Given that the type must be comparable, then yes Teiid will use a sort merge join or whatever appropraite optimization to speed the join processing.

                     

                    > It seems like some source-specific information is necessary to know how to perform convert_to_universal_key_representation('t1', col1) such as knowing that col1 is already in a source-specific internal representation and how to convert that to universal representation.

                     

                    The function arguments in general will need to be column references and not literals.  At the very least aliasing complicates trying to use a literal as a table reference.  So you'll have something like "fk(t1.col1, t2.col1) = pk(t2.col1)" instead.

                     

                    > What do you think?  If you think this is a workable approach, I'm going to start on it.  My intention will be to design it for generic use and, if the implementation turns out well, I'd love to donate it back to the community.

                     

                    I think I need to give this problem more thought.  The initial biggest issue is the type representing the keys.  We'll want to flesh this out some more for sure.

                    • 7. Re: Federated polymorphic foreign keys
                      markaddleman

                      A quick thought:  In general, it seems very useful for Teiid to treat Java objects as a comparable type.  The onus would be on the application to provide proper equals and hashCode support.  I suspect that Teiid could treat java.lang.Comparable as optional, only allowing ORDER BY on objects that implement it.  With this capability, a stored procedure and some functions, applications could join any collection of objects with any other data source.  Datomic treats collections this way.

                       

                      A further embellishment would add support for scripting languages to achieve data access.  I imagine something like:

                      SELECT script_fn("groovy", "$.getValue") FROM (CALL java_iterable(iterable)) t

                       

                      Back to the original topic:

                      Yes, more thought is needed.  We're in the last stages of a release and wrapping up planning for our new one so I'll have to put this on the back burner for a couple of weeks but if you have questions about the use cases, please post here and I'll respond as quickly as I can.

                       

                      Thanks

                       

                      Message was edited by: Mark Addleman - Corrected my example and realized that the script_fn would probably have to take the table reference, something like SELECT script(t, "groovy", "$.getValue").  Not sure if that's allowed under current rules.

                      • 8. Re: Federated polymorphic foreign keys
                        shawkins

                        > A quick thought:  In general, it seems very useful for Teiid to treat Java objects as a comparable type.  The onus would be on the application to provide proper equals and hashCode support

                         

                        To be comparable, compareTo is required as interally we don't differentiate the other comparison operations (<, >, etc.) from =.

                         

                        > With this capability, a stored procedure and some functions, applications could join any collection of objects with any other data source.  Datomic treats collections this way.

                         

                        You'll need to provide a full example.  There is one step to map a collection to rows, there is another step to determine a relation between rows.

                         

                        > A further embellishment would add support for scripting languages to achieve data access.  I imagine something like:

                        SELECT script_fn("groovy", "$.getValue") FROM java_iterable(iterable)

                         

                        In you're example I would think that you still need to explicitly pass some object reference into the script function.  At a base level we currently support XQuery via XMLQuery/XMLTable.  Adding MVEL/Groovy/etc. can also be done and would have some corresponding scriptquery/scripttable constructs for scalar/tabular operations - e.g. SELECT t.* FROM SCRIPTTABLE('groovy', '$.iterator', PASSING collection_object COLUMNS x string PATH '$.getSomething') AS t ... - with the expectation that the row generation expression will evaluate to an iterator (or a collection) that Teiid would iterator over and evaluate the columns against.  Having a higher level short-cut, such as a java_iterable table function would be straight-forward.

                         

                        Of course if we do know the metadata of the object classes, then we can do more.  In this case we are also considering adding more support JPA style object query semantics are we further integrate with sources like Infinispan or at least provide additional options for existing object sources, such as salesforce.

                        • 9. Re: Federated polymorphic foreign keys
                          shawkins

                          https://issues.jboss.org/browse/TEIID-2061 has been logged to cover at least allowing object equality.

                          • 10. Re: Federated polymorphic foreign keys
                            markaddleman

                            Picking this back up after some additional hammock time.  Instead of thinking of this problem as a foreign key problem, I'm thinking of this as a query problem.  A foreign key can be represented as a query.  For example, suppose we have two tables, Person and PhoneNumber.  There is a one-to-many relationship between a person and phone number.

                            Person table:

                            PKEY - ID

                            Name - String

                             

                            PhoneNumber table:

                            ForeignQuery - A new data type called QUERY that returns a rowset from the people table

                            Number - String

                             

                            An example value of the PhoneNumber.ForeignQuery column is

                            SELECT * FROM Person WHERE pkey=1
                            

                            This would be the moral equivalent of storing a 1 as a foreign key.  The big difference, of course, is all the implicit relation stuff is now explicit.

                             

                            Here's where I think it gets interesting:  If I now added a ConferenceRoom table with its own primary key column and a different table structure than Person, I can still store its phone numbers directly in the PhoneNumber table with ForeignQuery values like

                            SELECT * FROM ConferenceRoom WHERE pkey=1
                            

                             

                            How might a client request something as simple as a list of persons and their phone numbers?  Perhaps something like

                            SELECT PhoneNumber.Number, Person.Name FROM PhoneNumber INNER JOIN PhoneNumber.ForeignQuery
                            

                            Executing this query is not very different than executing the query with a traditional foreign key.  For every row in PhoneNumber, execute the query described in the ForeignQuery column and associate the result set with the current PhoneNumber row (I believe there would be huge performance implications without improvements to caching).  Any ForeignQuery that references the ConferenceRoom table could be skipped.

                             

                            A client requesting data from both the ConferenceRoom and Person table is not too different: 

                            SELECT ConferenceRoom.Location, Person.Name FROM PhoneNumber OUTER JOIN PhoneNumber.ForeignQuery
                            

                            The engine would execute the query as described above and provide null the columns for Location or Name as appropriate. 

                             

                            If we had this magical QUERY data type, I think the following use case hangs together.

                             

                            Suppose we had a UI to present a list of phone numbers.  The user could select a phone number from the list and view the entity with that phone number.

                             

                            Under the approach described above, the client still must know that both conference rooms and people have a foreign key relationship to phone numbers.  Obviously, this becomes problematic as the number of things with phone numbers grows.  However, if the application's convention were to gaurantee that the result of PhoneNumber.ForeignQuery were a single row XML document of the form

                            <result-set source=[query that produced the result set]>
                               <row>
                                  <column name=... value=...>
                                   ...
                               </row>
                               ...
                            </result-set>
                            

                            then the client could fill the list of phone numbers by executing the query

                            SELECT PhoneNumber.ForeignKey.xmldoc as ThingsWithPhoneNumber, PhoneNumber.Number FROM PhoneNumber OUTER JOIN PhoneNumber.ForeignKey
                            

                            When the user selects a phone number, the client can mechanically form a query

                            from the value of ThingsWithPhoneNumber.

                            SELECT * FROM [query that produced the result set] WHERE c1=v1_1 AND c2=v1_2...
                            UNION
                            SELECT * FROM [query that produced the result set] WHERE c1=v2_1 AND c2=v2_2...
                            

                            and display the results entirely through metadata. 

                             

                            I've done a bit of googling and find lots of references to this polymophic foreign key problem but no traditional RDBMS that addresses it.  In fact, the only query system that fully addresses these issues are logic/relational systems like Datalog and Prolog/Kanren/core.logic (it was core.logic that has really motivated me to think about this problem more).  Neither of these systems suffer from SQL's everything-is-a-rectangle obsession but, that can be addressed by representing result sets as XML.  The big, new enabling functionality is the QUERY database. 

                             

                            I guess my biggest questions at this stage are, what do you think of the QUERY data type idea?  I don't think it breaks anything in the relational model and can be represented nicely within SQL.  Do you consider these kinds of extensions to SQL within Teiid's scope? 

                            • 11. Re: Federated polymorphic foreign keys
                              shawkins

                              I have to think a lot more about QUERY data type.  Just to make sure, are you proposing that it has a value per row?  How are you storing both Person and ConferenceRoom in this example?

                               

                              It may help me to think about the problem statement a little more traditionally.  With:

                               

                              > SELECT PhoneNumber.ForeignKey.xmldoc as ThingsWithPhoneNumber, PhoneNumber.Number FROM PhoneNumber ...

                               

                              absent a new data type it looks like you are proposing something like:

                               

                              {code}

                              SELECT XMLELEMENT(NAME "ThingsWithPhoneNumber", XMLAGG(ThingWithPhoneNumber)), PhoneNumber.Number FROM (

                               

                                SELECT CASE WHEN ConferenceRoomPhoneNumbers.crPk1 IS NOT NULL THEN XMLELEMENT(NAME "row", 'ConferenceRoom', XMLFOREST(ConferenceRoomPhoneNumbers.crPk1 ...)) as ThingWithPhoneNumber, PhoneNumber.Number FROM PhoneNumber LEFT OUTER JOIN ConferenceRoomPhoneNumbers ON ConferenceRoomPhoneNumbers.phPk = PhoneNumber.pk ...

                              UNOIN ALL

                              ...

                                SELECT CASE WHEN ... PhoneNumber LEFT OUTER JOIN PersonPhoneNumbers ...

                              ...

                              ) AS X GROUP BY PhoneNumber.Number

                              {code}

                               

                              Which would produce documents like:

                               

                              {code}

                              <ThingsWithPhoneNumber>

                                <row>ConferenceRoom

                                  <crPk1>value</crPk1>

                                  ...

                                </row>

                              ...

                              <row>Person

                                  <pPk1>value</pPk1>

                                  ...

                                </row>

                              </ThingsWithPhoneNumber>

                              {code}

                               

                              Note, you need a CASE expresssion in the SELECTs so that you get an empty document if there are no matches:

                              {code}

                              <ThingsWithPhoneNumber>

                              </ThingsWithPhoneNumber>

                              {code}

                               

                              Here crPk1 would be the actual name of the ConferenceRoom's first primary key column.  This also assumes that you have a proper pk/fk relationship via a ConferenceRoomPhoneNumbers, PersonPhoneNumbers etc. joining tables.

                               

                              Is this basically what you are trying to encapsulate with the QUERY type?  Would it be possible to introduce the additional joining tables for each relationship and generate a ThingsWithPhoneNumber view to take advantage of those? Performance-wise you could also cache the row xml in the joining table and you could also use the WITH clause to force a single scan against PhoneNumber.

                               

                              Am I understanding you correctly or am I off the mark?

                               

                              Steve