Open Source Data Quality and Profiling (OSDQ) is one of the data quality, data profiling tools available with Apache 2 license. This project has been around for few years, and seems to have very decent feature set. This discussion is to explore the integration possibilities of OSDQ with Teiid.
Currently Teiid runtime has few features for data cleansing and data enrichment, however Teiid tooling completely lacks any kind of data profiling or data lineage features. OSDQ seems to have lot features in this space that Teiid could take a collaborative approach to solving the issue in providing these features to both Teiid and OSDQ communities. The idea behind this forum discussion is to provoke discussion with community to come up with requirements/suggestions/alternatives.
The below are 3 different tracks we could be looking more deeply at OSDQ.
1. Runtime Data Enrichment/Cleansing
Teiid runtime does not hold or persist any data, however providing the on the fly extended data enrichment/cleansing capabilities using User Defined Functions based OSDQ libraries is very useful. I think this requires lowest effort and with very loose dependency with OSDQ libraries. Here we need to identify the list of functions and their signatures that are available in OSDQ, and use them to define system or UDF functions in Teiid. Code changes may be required on OSDQ side as some functions may not have envisioned before to be used as on the fly nature. We also need to make sure the OSDQ has public API that Teiid make use of it.
2. Use of profiling information in Query Optimization
Teiid runtime currently depends upon design time supply of the costing information, or update to the extension metadata in regards to costing information. If OSDQ offers advanced techniques or algorithms in calculating costing information like NULL count, or distinct count etc those could be used dynamically to get the information about source, thus making more intelligent optimizations in the query planning. Here again we need to identify such routines in OSDQ and measure their efficiency and availability to consume externally in Teiid.
3. Data Profiling UI
This is most complicated and time consuming part. Teiid Designer is Eclipse based framework for building Teiid VDB artifacts. This framework does not offer any data profiling or data quality management capabilities, nor any of the other management GUIs that Teiid offers. OSDQ has Swing based UI. So technologies so far do not match, and current trends are using Web based UI. Here IMO effort needs to be focused on
- what are the list core features that OSDQ offers in this space as library excluding the UI part. Naturally these may be integrated with OSDQ current UI, need to figure out if they can be modularized enough for replacing the UI to work towards a web UI
- what are the features on road map OSDQ, we need to plan for
- how can we efficiently use above features to come up with a Web based UI. The questions here are
- what technologies to use (this may be simple as current efforts of web tooling went through this exercise)
- what are the UI flows and coming up wire frames for it.
- resources, scoping,timelines and implementation.
These are just starting points for discussion, so please feel free to edit or add more comments.