-
1. Re: Custom Translator for cleansing
markaddleman May 24, 2013 10:29 AM (in response to pinkstondevin)1 of 1 people found this helpfulWhat do you mean by cleansing?
We often deal with "dirty" data using a combination of views, materialized views and caching.
-
2. Re: Custom Translator for cleansing
rareddy May 24, 2013 10:33 AM (in response to pinkstondevin)1 of 1 people found this helpfulYou can do that in your custom translator. Currently Teiid does not offer any data quality tools. Possibly you can inject like Drools rules engine and write them in there.
-
3. Re: Custom Translator for cleansing
pinkstondevin May 24, 2013 10:43 AM (in response to rareddy)Mark-
By cleansing I mean mainly just removing duplicates. By using views are you using joins/select distinct to generate unique values? The problem we have is maybe there is a phone number column - "(888)-888-8888", "888-888-8888". We need to remove that duplicate, and I was wondering if in the translator that could be done. We have the tools to remove this duplicate data, I was just wondering if it was possible in Teiid.
Does this setup efficiently cleanse data for you?
Ramesh-
Thank you for your response. This sounds like maybe the best option, i appreciate your quick feedback!
-
4. Re: Custom Translator for cleansing
markaddleman May 24, 2013 11:29 AM (in response to pinkstondevin)By using views are you using joins/select distinct to generate unique values? The problem we have is maybe there is a phone number column - "(888)-888-8888", "888-888-8888". We need to remove that duplicate, and I was wondering if in the translator that could be done.
We have format normalization problems with our data, too. We take this approach:
- Write a custom scalar function to normalize the data. This scalar function would be written as a custom translator but the translator would have no direct connection to the data source. In your case, if would be named normalize_phone_number(x) that takes any phone number string and returns the preferred format.
- Define a view that uses the normalization function. In your case, I imagine the view would be something like SELECT DISTINCT name, normalize_phone_number(dirty_data.phone_number) as phone_number FROM dirty_data
- If necessary, materialize the view or use query caching
-
5. Re: Custom Translator for cleansing
pinkstondevin May 24, 2013 2:57 PM (in response to markaddleman)Mark,
Thank you very much for taking the time to explain this, it's very helpful.
I really like the idea here and how it can stay inside of Teiid.
Thanks
-
6. Re: Custom Translator for cleansing
pinkstondevin Jun 6, 2013 11:15 AM (in response to pinkstondevin)Would you recommend using a Delegating Translator or a regular custom translator since i'm not connecting to the data source? I want to take advantage of the existing translator to talk to the physical data source, I am just trying to modify the results.
Thank you
-
7. Re: Custom Translator for cleansing
rareddy Jun 6, 2013 3:19 PM (in response to pinkstondevin)You can either extend current translator class and give it different name, or use delegating translator. If the changes are intrusive, first one may be preferred, otherwise I would say second one is easier.
-
8. Re: Custom Translator for cleansing
pinkstondevin Jun 10, 2013 10:26 AM (in response to rareddy)We are trying to modify the results using a delegating translator we have created. In the translator development documentation, there is an example of extending the BaseDelegatingExecutionFactory, however I was wondering if there were any more detailed examples available?
In our delegating translator, we are trying to get the next row, but frankly, I am little confused on how exactly we can pull in that row, and modify the data, then push it back to be returned to the user.
public ResultSetExecution createResultSetExecution(QueryExpression command, ExecutionContext executionContext, RuntimeMetadata metadata, Object connection) throws TranslatorException{
//Get the next row of data from "List<?> next() throws TranslatorException, DataNotAvailableException;"
return super.createResultSetExecution(command, executionContext, metadata, connection);
}
Thank you very much!
-
9. Re: Custom Translator for cleansing
shawkins Jun 10, 2013 11:27 AM (in response to pinkstondevin)Create a wrapping ResultSetExecution where you can override next:
public ResultSetExecution createResultSetExecution(QueryExpression command, ExecutionContext executionContext, RuntimeMetadata metadata, Object connection) throws TranslatorException{
final ResultSetExecution rse = super.createResultSetExecution(command, executionContext, metadata, connection);
return new ResultSetExecution() {
@Override
public void execute() throws TranslatorException {
rse.execute();
}
@Override
public void close() {
rse.close();
}
@Override
public void cancel() throws TranslatorException {
rse.cancel();
}
@Override
public List<?> next() throws TranslatorException, DataNotAvailableException {
List<?> row = rse.next();
//modify row
return row;
}
};
}
-
10. Re: Custom Translator for cleansing
pinkstondevin Jun 10, 2013 11:53 AM (in response to shawkins)Thank you Steven!
-
11. Re: Custom Translator for cleansing
pinkstondevin Jun 13, 2013 9:09 AM (in response to shawkins)I've deployed my delegating translator and implemented it in my VDB, however I'm not seeing results that the translator is being used. I've thrown in a few sysout to see if I'm entereing to no avail. I also tried returning null for the next row to see if my query would reutrn a blank dataset, but i received data back instead. I created a small mysql table with 4 columns, all based off the spelling of new york. "New York", "nyc", "newyork", "ny" - eventually planning to implement rules for cleaning. Right now I'm just trying to make sure I am actually pulling in that row of data and analyzing but, but it seems I'm not even getting into the next().
I implemented my delegator in my VDB like:
<property name="UseConnectorMetadata" value="true"/>
<model name="MySQLCleaner"><source name="cleansingMySQL-connector" translator-name="custom-delegator" connection-jndi-name="java:/cleansingMySQL"/></model>
<translator name="custom-delegator" type="cleaner"><property name="delegateName" value="mysql5"/></translator>
@Translator(name="cleaner")
public class cleaner extends BaseDelegatingExecutionFactory<Object, Object> {
@Override
public ResultSetExecution createResultSetExecution(QueryExpression command, ExecutionContext executionContext, RuntimeMetadata metadata, Object connection) throws TranslatorException {
final ResultSetExecution rse = super.createResultSetExecution(command, executionContext, metadata, connection);
return new ResultSetExecution() {
@Override
public List<?> next() throws TranslatorException, DataNotAvailableException {
List<?> row = rse.next();
int index = row.indexOf("New York");
System.out.println("index of new york: " + index);
return row; //tried using return null for a test as well
}
@Override
public void close() {
rse.close();
}
@Override
public void cancel() throws TranslatorException {
rse.cancel();
}
@Override
public void execute() throws TranslatorException {
rse.execute();
}
};
}
}
If i issue a select query (select * from newyork) "newyork is name of table", I receive correct results, however my translator did not print anything out. I had tried returning NULL as well as i have in a comment, but that didn't work.
Thank you
-
12. Re: Custom Translator for cleansing
shawkins Jun 13, 2013 9:40 AM (in response to pinkstondevin)> I created a small mysql table with 4 columns, all based off the spelling of new york.
Do you mean the column names have the different spellings or the data does?
> however my translator did not print anything out
From the above you should at least see "index of new york: " with at the very least -1 for the index. If that is not occuring then something else is wrong with the setup. Note that hot deployment of a translator does not currently affect running vdbs as they retain references to whatever instance of a translator they had when they first started.
> I had tried returning NULL as well as i have in a comment, but that didn't work.
If you just return null, then you should see no results at all.
Depending upon the details of your scenario, you may want to be more targeted about the cleansing as a linear search of every row returned through the system may be a significant overhead. For example you can search the command for projected columns that you want to modify - but that of course can get complicated when the source has broad pushdown support.
And not to complicate things, but another possiblity to consider for Teiid 8.4+ would be to use an any-authenticated role that specifies column masks using cleansing expressions:
{code}
...
<model name="MySQLCleaner"><source name="cleansingMySQL-connector" translator-name="mysql5" connection-jndi-name="java:/cleansingMySQL"/></model>
<data-role name="Cleanser" any-authenticated="true">
<description>Default cleansing</description>
<!- allow read - and any other applicable persmissions against the whole schema -->
<permission>
<resource-name>MySQLCleaner</resource-name>
<allow-read>true</allow-read>
</permission>
<!-- cleanse with a mask - which could be any expression including a udf -->
<permission>
<resource-name>MySQLCleaner.newyork.col</resource-name>
<mask>case when col LIKE_REGEX 'ny|nyc' then 'New York' else col end</mask>
</permission>
</data-role>
...
{code}
see https://github.com/teiid/teiid-quickstarts/tree/master/dynamicvdb-dataroles and the Reference for more. The additional consideration here is that masking is logically applied before values are used, so if the masking expression cannot be pushed down then there is generally a performance impact.
-
13. Re: Custom Translator for cleansing
pinkstondevin Jun 13, 2013 11:01 AM (in response to shawkins)> I created a small mysql table with 4 columns, all based off the spelling of new york.
Do you mean the column names have the different spellings or the data does?
Here is a snippit below. I just copied it ontop of itself for testing purposes. The first row is just column names.
nyOne, nyTwo, nyThree, nyFour
New York, nyc, newyork, NY
New York, nyc, newyork, NY
New York, nyc, newyork, NY
New York, nyc, newyork, NY
> however my translator did not print anything out
From the above you should at least see "index of new york: " with at the very least -1 for the index. If that is not occuring then something else is wrong with the setup. Note that hot deployment of a translator does not currently affect running vdbs as they retain references to whatever instance of a translator they had when they first started.
I actually return the full dataset right now. My translator is deployed successfuly, no build/compile errors or deployment erros. It looks like it doesn't really get into my delegating translator.
I assume I have something wrong with my setup/deployment then with my jar. It took me a while to successfully deploy the delegating translator. What do you mean by a "hot" deployment? I currently have it deployed under the modules directory, and created an entry in my config file for deployment.
When I used "return null", i was still receiving results.
Thank you for the suggestion, I will look into that. Using the mask could be very benefecial, I am expecting after this first iteration to have a huge overhead that will have to be corrected.
Thank you!
-
Cleaning-vdb.xml 545 bytes
-
module.xml 523 bytes
-
cleaner.jar 2.3 KB
-
-
14. Re: Custom Translator for cleansing
shawkins Jun 13, 2013 11:31 AM (in response to pinkstondevin)> What do you mean by a "hot" deployment?
Using a cli or admin api jar deployment containing a translator.
> When I used "return null", i was still receiving results.
Yes, that means that something still isn't correct with the setup. Make sure the vdb you're working with is the correct one (perhaps purposely use a bogus translator name, such as cleaner1 and make sure there is a deployment failure) and make sure that the custom translator instance is being picked up in any fashion - perhaps by adding a sys out in a default constructor. Otherwise you'll want to post what you have to see if we can reproduce or spot something that is amiss.
Steve