5 Replies Latest reply on Jul 2, 2015 5:41 PM by shawkins

Further Integration of Teiid and Hadoop

acoliver Jun 29, 2015 5:02 PM

Quick Introduction. I'm a long time listener first time caller. I worked for JBoss pre-acquistion, founded a project called POI which ported Microsoft's file formats to Java which I donated to Apache (poi.apache.org) and served briefly on the board of the Open Source Initiative. I also run a big data consultancy called Mammoth Data... I've used Teiid in the past, but haven't recently. Yet I do kind of see it as a potential solution to some of the kinds of problems that I encounter now.

While I understand the use cases for Something<----JDBC<---Teiid <--- Hive which already exists - I think there are other use cases that could be served by integrating Teiid in reverse. Meaning allowing Teiid to provide data to Hive and Spark. This would be similar to what others have done with Solr (Playing with Apache Hive and SOLR | Chimpler https://lucidworks.com/fusion/hadoop/) and Elasticsearch (Hadoop: Immediate Insight into Big Data | Elastic). The idea would be to reduce the burden of ETL on developers while using Spark, Hive, Hadoop, etc for the data processing, transform and analytics.

Concretely:

1. A Hive connector

2. A Spark connector

3. A Hive DDL generator based on the schema meta-model

One would then connect Hive to Teiid. generate the DDL for Hive and then you could directly query Teiid via Hive or SparkSQL.

To go further, integrate the cache into Hadoop (speaking of the ecosystem) and have it overflow to HDFS. Configure the "staleness" of the data through meta-data as opposed to load processes (ala PIG/Sqoop/Oozie). This would allow data to be distributed throughout the cluster potentially "ahead of time." -- Additionally the query engine could issue SparkQL thereby distributing the processing.

The goal would be to avoid writing ETL or using SQOOP but leave my data in place yet take advantage of the distributed platform through a combination of caching and distribution.

Has anyone thought of doing something like this? Why? Why not?

1. Re: Further Integration of Teiid and Hadoop

rareddy Jun 29, 2015 11:22 PM (in response to acoliver)

Andrew,

Welcome to the Teiid forum. Thank you for Apache POI, we use it to read data from Microsoft Excel in Teiid. Excellent framework.

You raise a very interesting paradigm, frankly we have not thought about it in the way you describe. We just put a task to investigate Spark and potential ways to change the compute model for Teiid 9.0. So, there are no details there yet. I was under the impression that data under Hive --> HDFS is always need to be sharded format thus never thought about Teiid in bottom. But ETL usecase seems very useful and Teiid can with one scoop provide access to many different enterprise sources with consistent interface, and caching sounds feasible approach.

Is this something you are considering to develop? We would love to participate and help with whatever we can. I need to check out the Hive Connector stuff, to see what it takes.

Thanks

Ramesh..
Actions
2. Re: Further Integration of Teiid and Hadoop

acoliver Jul 1, 2015 5:01 PM (in response to rareddy)

Thank you.

Sorry for the latency. I couldn't get back in for a good long while.

You raise an excellent point. So there are two ways of looking at this. On one hand, these large analytics projects are like 90% ETL and 10% whatever you wanted to do in the first place. On the other hand if I built out a massive 100 node Hadoop/Spark cluster, do I really want to be limited to SQL/JDBC? If you rationalize the SQL into Spark, then what seems to "follow" to me is "what about the data?" Meaning if the data is still in an RDBMS, cached in a 1-8 node Teiid cluster, I still have a big lead time from when I have the data to when the data is loaded into a Spark RDD (or dataframe if you like). Then next time I work on that working set...I do that again? So then it follows that I want to cache the data on the Hadoop side, but if I'm pulling data in and it doesn't fit in memory or I want to preserve memory for workingsets and not for cold storage -- then it follows that I might want some combination of memory and disk. HDFS doesn't "shard" so much as takes the place of how a SAN or RAID drive stripes via blocks. What pieces of Teiid do you think could distribute easily? What pieces would need more brain surgery?

Yes it is something I'm looking to collaborate in thinking through and potentially co-developing. The hive and pig connectors seem like low hanging fruit to see what people think about the idea conceptually.

-Andy
Actions
3. Re: Further Integration of Teiid and Hadoop

shawkins Jul 2, 2015 10:12 AM (in response to acoliver)

> Has anyone thought of doing something like this? Why? Why not?

Ramesh, I don't think it's fair to say we haven't thought of doing this, rather at first glance this is primarily using Teiid as an access layer.

Last time I looked into it SparkQL and their optimizer has a very limited approach to push down - mainly just predicates, no joins, etc. So most of the ETL replacement logic would have to take the form of views (or possibly procedures) in Teiid. Hive seems to be evolving in their push down capabilities, but I have not looked at it extensively.

> What pieces of Teiid do you think could distribute easily? What pieces would need more brain surgery?

How would you like it to be distributed? Are you talking about processing, or in terms of cache utilization, etc.?

> Yes it is something I'm looking to collaborate in thinking through and potentially co-developing. The hive and pig connectors seem like low hanging fruit to see what people think about the idea conceptually.

Yes, if there is something that we can do to have better support for Teiid from Hive that would be good.

Were you seeing that Spark's JDBC connectivity is sufficient or would it need more customization?
Actions
4. Re: Further Integration of Teiid and Hadoop

acoliver Jul 2, 2015 3:23 PM (in response to shawkins)

Hi Steven,

To me it seems that you'd want both the data and the processing distributed. What I attempted to outline is that the queries would be processed by Spark and that the data would be distributed either via a distributed cache and/or overflowed to disk (HDFS).

The pulldown function is described here: http://www.river-of-bytes.com/2014/12/filtering-and-projection-in-spark-sql.html

Some of the problems you allude to appear to have been addressed by Spark 1.3: What's new for Spark SQL in Spark 1.3 which is captured in two diagrams:

For Hive as mentioned the connector and ideally a DDL generator (as that is the unpleasant part of dealing with the SOLR connector from Lucidworks, for instance). But that leaves us somewhat bottlenecked where obviously distributing the processing and data would be more optimal (but more complicated).

I sort of think a REST interface would be preferred to JDBC, this is more of a gut feeling but based on the issue of packing into JDBC interfaces and then unpacking and converting to Hive (for instance) datatypes and the delta between them. It seems to me like you'd want to be more direct and like that would be more efficient and support more efficiency in Spark or other places you might expose the data. It seems like that might be useful anyhow. If one wanted to consume outside of the Java ecosystem, that could be useful...
Actions
5. Re: Further Integration of Teiid and Hadoop

shawkins Jul 2, 2015 5:41 PM (in response to acoliver)

> Some of the problems you allude to appear to have been addressed by Spark 1.3: What's new for Spark SQL in Spark 1.3 which is captured in two diagrams:

From what I've seen the individual jdbc source accesses planned by spark are effectively a scan with predicates - not joins, nor anywhere near the full breadth of sql supported by the source. That's not a huge concern though as needed complexity can be added through Teiid views.

> To me it seems that you'd want both the data and the processing distributed. What I attempted to outline is that the queries would be processed by Spark and that the data would be distributed either via a distributed cache and/or overflowed to disk (HDFS).

So you have both the Teiid and the Spark layer processing distributed? Currently Teiid's processing is only distributed in so much as the sources are separate/remote processes. Otherwise the processing is controlled via a processing plan that is only on a single instance. Introducing more distributed processing is possible, but the devil is in the details - if it needs to be made resilient, how aware the optimizer should be of locality, etc. We haven't gone down that path as when possible we like to enhance and extend our traditional role in oltp virtualization/integration rather than trying to solve high end analytics.

Can you flesh our the flow of an example query (with either simple relational or hierarchically structured data) as to what the responsibilities of Teiid would be in given scenario?

> I sort of think a REST interface would be preferred to JDBC, this is more of a gut feeling but based on the issue of packing into JDBC interfaces and then unpacking and converting to Hive (for instance) datatypes and the delta between them. It seems to me like you'd want to be more direct and like that would be more efficient and support more efficiency in Spark or other places you might expose the data. It seems like that might be useful anyhow. If one wanted to consume outside of the Java ecosystem, that could be useful...

JDBC would generally be the most efficient. Teiid can provide REST access via OData, which provides some of the additional standardization/abstraction you are looking for.
Actions

Go to original post