1 2 Previous Next 20 Replies Latest reply on Jun 14, 2013 12:56 PM by shawkins

    Custom Translator for cleansing

    pinkstondevin

      Hello all,

       

      Since Teiid does not have a native way to cleanse data, could extending a translator allow for the cleansing of data before it is brough into a Virtual Database?

       

      We have a few sources that contain dirty data, and are looking for a way to cleanse it without having to duplicate or move the data.

       

      Does Teiid have any constructs to facilitate the cleansing process?

       

      Thank you!

        • 1. Re: Custom Translator for cleansing
          markaddleman

          What do you mean by cleansing?

           

          We often deal with "dirty" data using a combination of views, materialized views and caching.

          1 of 1 people found this helpful
          • 2. Re: Custom Translator for cleansing
            rareddy

            You can do that in your custom translator. Currently Teiid does not offer any data quality tools. Possibly you can inject like Drools rules engine and write them in there.

            1 of 1 people found this helpful
            • 3. Re: Custom Translator for cleansing
              pinkstondevin

              Mark-

               

              By cleansing I mean mainly just removing duplicates.  By using views are you using joins/select distinct to generate unique values?  The problem we have is maybe there is a phone number column - "(888)-888-8888", "888-888-8888".  We need to remove that duplicate, and I was wondering if in the translator that could be done.  We have the tools to remove this duplicate data, I was just wondering if it was possible in Teiid.

               

              Does this setup efficiently cleanse data for you?

               

              Ramesh-

              Thank you for your response.  This sounds like maybe the best option, i appreciate your quick feedback!

              • 4. Re: Custom Translator for cleansing
                markaddleman

                By using views are you using joins/select distinct to generate unique values?  The problem we have is maybe there is a phone number column - "(888)-888-8888", "888-888-8888".  We need to remove that duplicate, and I was wondering if in the translator that could be done.

                We have format normalization problems with our data, too.  We take this approach:

                1. Write a custom scalar function to normalize the data.  This scalar function would be written as a custom translator but the translator would have no direct connection to the data source.  In your case, if would be named normalize_phone_number(x) that takes any phone number string and returns the preferred format.
                2. Define a view that uses the normalization function.  In your case, I imagine the view would be something like SELECT DISTINCT name, normalize_phone_number(dirty_data.phone_number) as phone_number FROM dirty_data
                3. If necessary, materialize the view or  use query caching
                • 5. Re: Custom Translator for cleansing
                  pinkstondevin

                  Mark,

                   

                  Thank you very much for taking the time to explain this, it's very helpful. 

                   

                  I really like the idea here and how it can stay inside of Teiid.

                   

                  Thanks

                  • 6. Re: Custom Translator for cleansing
                    pinkstondevin

                    Would you recommend using a Delegating Translator or a regular custom translator since i'm not connecting to the data source? I want to take advantage of the existing translator to talk to the physical data source, I am just trying to modify the results.

                     

                    Thank you

                    • 7. Re: Custom Translator for cleansing
                      rareddy

                      You can either extend current translator class and give it different name, or use delegating translator. If the changes are intrusive, first one may be preferred, otherwise I would say second one is easier.

                      • 8. Re: Custom Translator for cleansing
                        pinkstondevin

                        We are trying to modify the results using a delegating translator we have created.  In the translator development documentation, there is an example of extending the BaseDelegatingExecutionFactory, however I was wondering if there were any more detailed examples available? 

                         

                        In our delegating translator, we are trying to get the next row, but frankly, I am little confused on how exactly we can pull in that row, and modify the data, then push it back to be returned to the user.

                        public ResultSetExecution createResultSetExecution(QueryExpression command, ExecutionContext executionContext, RuntimeMetadata metadata, Object connection) throws TranslatorException{

                         

                              //Get the next row of data from "List<?> next() throws TranslatorException, DataNotAvailableException;"

                         

                                return super.createResultSetExecution(command, executionContext, metadata, connection);

                            }

                         

                        Thank you very much!

                        • 9. Re: Custom Translator for cleansing
                          shawkins

                          Create a wrapping ResultSetExecution where you can override next:

                           

                          public ResultSetExecution createResultSetExecution(QueryExpression command, ExecutionContext executionContext, RuntimeMetadata metadata, Object connection) throws TranslatorException{

                                              final ResultSetExecution rse = super.createResultSetExecution(command, executionContext, metadata, connection);

                                              return new ResultSetExecution() {

                                             

                                              @Override

                                              public void execute() throws TranslatorException {

                                                  rse.execute();

                                              }

                                             

                                              @Override

                                              public void close() {

                                                  rse.close();

                                              }

                                             

                                              @Override

                                              public void cancel() throws TranslatorException {

                                                  rse.cancel();

                                              }

                                             

                                              @Override

                                              public List<?> next() throws TranslatorException, DataNotAvailableException {

                                                  List<?> row = rse.next();

                                                  //modify row

                                                  return row;

                                              }

                                          };

                              

                              }

                          • 10. Re: Custom Translator for cleansing
                            pinkstondevin

                            Thank you Steven!

                            • 11. Re: Custom Translator for cleansing
                              pinkstondevin

                              I've deployed my delegating translator and implemented it in my VDB, however I'm not seeing results that the translator is being used.  I've thrown in a few sysout to see if I'm entereing to no avail.  I also tried returning null for the next row to see if my query would reutrn a blank dataset, but i received data back instead.  I created a small mysql table with 4 columns, all based off the spelling of new york.  "New York", "nyc", "newyork", "ny" - eventually planning to implement rules for cleaning.  Right now I'm just trying to make sure I am actually pulling in that row of data and analyzing but, but it seems I'm not even getting into the next().

                               

                              I implemented my delegator in my VDB like:

                              <property name="UseConnectorMetadata" value="true"/>

                              <model name="MySQLCleaner"><source name="cleansingMySQL-connector" translator-name="custom-delegator" connection-jndi-name="java:/cleansingMySQL"/></model>

                              <translator name="custom-delegator" type="cleaner"><property name="delegateName" value="mysql5"/></translator>

                               

                               

                              @Translator(name="cleaner")

                              public class cleaner extends BaseDelegatingExecutionFactory<Object, Object> {

                               

                                  @Override

                                  public ResultSetExecution createResultSetExecution(QueryExpression command, ExecutionContext executionContext, RuntimeMetadata metadata, Object connection) throws TranslatorException {

                               

                                      final ResultSetExecution rse = super.createResultSetExecution(command, executionContext, metadata, connection);

                               

                                      return new ResultSetExecution() {

                               

                               

                                          @Override

                                          public List<?> next() throws TranslatorException, DataNotAvailableException {

                               

                                              List<?> row = rse.next();

                                              int index = row.indexOf("New York"); 

                                              System.out.println("index of new york: " + index); 

                                              return row;     //tried using return null for a test as well

                               

                                          }

                               

                                          @Override

                                          public void close() {

                                              rse.close();

                                          }

                               

                                          @Override

                                          public void cancel() throws TranslatorException {

                                              rse.cancel();

                               

                                          }

                               

                                          @Override

                                          public void execute() throws TranslatorException {

                                              rse.execute();

                                          }

                                      };

                               

                               

                               

                                  }

                              }

                              If i issue a select query (select * from newyork) "newyork is name of table", I receive correct results, however my translator did not print anything out.  I had tried returning NULL as well as i have in a comment, but that didn't work.

                               

                              Thank you

                              • 12. Re: Custom Translator for cleansing
                                shawkins

                                > I created a small mysql table with 4 columns, all based off the spelling of new york.

                                 

                                Do you mean the column names have the different spellings or the data does?

                                 

                                > however my translator did not print anything out

                                 

                                From the above you should at least see "index of new york: " with at the very least -1 for the index.  If that is not occuring then something else is wrong with the setup.  Note that hot deployment of a translator does not currently affect running vdbs as they retain references to whatever instance of a translator they had when they first started.

                                 

                                > I had tried returning NULL as well as i have in a comment, but that didn't work.

                                 

                                If you just return null, then you should see no results at all.

                                 

                                Depending upon the details of your scenario, you may want to be more targeted about the cleansing as a linear search of every row returned through the system may be a significant overhead.  For example you can search the command for projected columns that you want to modify - but that of course can get complicated when the source has broad pushdown support. 

                                 

                                And not to complicate things, but another possiblity to consider for Teiid 8.4+ would be to use an any-authenticated role that specifies column masks using cleansing expressions:

                                 

                                {code}

                                ...

                                     <model name="MySQLCleaner"><source name="cleansingMySQL-connector" translator-name="mysql5" connection-jndi-name="java:/cleansingMySQL"/></model>

                                 

                                    <data-role name="Cleanser" any-authenticated="true">

                                        <description>Default cleansing</description>

                                 

                                        <!- allow read - and any other applicable persmissions against the whole schema -->         

                                        <permission>

                                            <resource-name>MySQLCleaner</resource-name>

                                            <allow-read>true</allow-read>

                                        </permission>

                                 

                                       <!-- cleanse with a mask - which could be any expression including a udf -->

                                       <permission>

                                            <resource-name>MySQLCleaner.newyork.col</resource-name>

                                            <mask>case when col LIKE_REGEX 'ny|nyc' then 'New York' else col end</mask>

                                        </permission>

                                 

                                        ....

                                 

                                    </data-role>

                                ...

                                {code}

                                 

                                 

                                see https://github.com/teiid/teiid-quickstarts/tree/master/dynamicvdb-dataroles and the Reference for more.  The additional consideration here is that masking is logically applied before values are used, so if the masking expression cannot be pushed down then there is generally a performance impact.

                                • 13. Re: Custom Translator for cleansing
                                  pinkstondevin

                                  > I created a small mysql table with 4 columns, all based off the spelling of new york.

                                   

                                  Do you mean the column names have the different spellings or the data does?

                                  Here is a snippit below.  I just copied it ontop of itself for testing purposes.  The first row is just column names.

                                   

                                  nyOne, nyTwo, nyThree, nyFour

                                  New York, nyc, newyork, NY

                                  New York, nyc, newyork, NY

                                  New York, nyc, newyork, NY

                                  New York, nyc, newyork, NY

                                   

                                  > however my translator did not print anything out

                                   

                                  From the above you should at least see "index of new york: " with at the very least -1 for the index.  If that is not occuring then something else is wrong with the setup.  Note that hot deployment of a translator does not currently affect running vdbs as they retain references to whatever instance of a translator they had when they first started.

                                  I actually return the full dataset right now.  My translator is deployed successfuly, no build/compile errors or deployment erros.  It looks like it doesn't really get into my delegating translator.

                                   

                                  I assume I have something wrong with my setup/deployment then with my jar.  It took me a while to successfully deploy the delegating translator.  What do you mean by a "hot" deployment?  I currently have it deployed under the modules directory, and created an entry in my config file for deployment.

                                   

                                  When I used "return null", i was still receiving results. 

                                   

                                  Thank you for the suggestion, I will look into that.  Using the mask could be very benefecial, I am expecting after this first iteration to have a huge overhead that will have to be corrected. 

                                   

                                  Thank you!

                                  • 14. Re: Custom Translator for cleansing
                                    shawkins

                                    > What do you mean by a "hot" deployment?

                                     

                                    Using a cli or admin api jar deployment containing a translator.

                                     

                                    > When I used "return null", i was still receiving results.

                                     

                                    Yes, that means that something still isn't correct with the setup.  Make sure the vdb you're working with is the correct one (perhaps purposely use a bogus translator name, such as cleaner1 and make sure there is a deployment failure) and make sure that the custom translator instance is being picked up in any fashion - perhaps by adding a sys out in a default constructor.  Otherwise you'll want to post what you have to see if we can reproduce or spot something that is amiss.

                                     

                                    Steve

                                    1 2 Previous Next