9 Replies Latest reply on Apr 20, 2018 12:42 PM by Steven Hawkins

    Distinct keyword and like_regex performance is very slow

    Avinash Chandwani Newbie

      I am using TEIID to connect to an nosql service, which cannot take care of distinct, so we are getting lot of records, and distinct is performed at TEIID layer.

       

      The distinct keyword and like_regex performance of TEIID is found to be very slow. Can any improvement be achieved in this? By extending or changing the logic how distinct and like_regex would be taken care by TEIID?

       

      Message was edited by: Avinash Chandwani

        • 1. Re: Distinct keyword and like_regex performance is very slow
          Steven Hawkins Master

          > Can any improvement be achieved in this?

           

          Can you provide the query plan?

          • 2. Re: Distinct keyword and like_regex performance is very slow
            Avinash Chandwani Newbie

            You mean explain plan for no-sql query?

             

            We have only one Backed DB Table from no-sql database, a view created in TEIID to call the table.

             

            Query is like:

            select distinct a from table t where <some conditions>

             

            We get duplicated records from no-sql DB to TEIID and then TEIID does the distinct task.

            • 3. Re: Distinct keyword and like_regex performance is very slow
              Steven Hawkins Master

              > You mean explain plan for no-sql query?

               

              Yes.  If you get the query plan after running the query it will not only show the processing nodes, but the time taken at each of the nodes: Query Plans · Teiid Documentation

               

              > select distinct a from table t where <some conditions>

               

              More than likely this will just be a simple plan with an access node, select node, project, and dup removal.

               

              Also what version of Teiid are you using?

              • 4. Re: Distinct keyword and like_regex performance is very slow
                Avinash Chandwani Newbie

                Thank you Steven for your quick responses.

                 

                We are using TEIID 9.0.1.

                 

                Here is the explain plan for the query am trying to execute:

                DupRemoveNode

                  + Relational Node ID:0

                  + Output Columns:agent_name (string)

                  + Statistics:

                    0: Node Output Rows: 3

                    1: Node Next Batch Process Time: 2433

                    2: Node Cumulative Next Batch Process Time: 2663

                    3: Node Cumulative Process Time: 34128

                    4: Node Next Batch Calls: 1009

                    5: Node Blocks: 1008

                  + Cost Estimates:Estimated Node Cardinality: 4.77218624E8

                  + Child 0:

                    AccessNode

                      + Relational Node ID:1

                      + Output Columns:agent_name (string)

                      + Statistics:

                        0: Node Output Rows: 2068765

                        1: Node Next Batch Process Time: 230

                        2: Node Cumulative Next Batch Process Time: 230

                        3: Node Cumulative Process Time: 34127

                        4: Node Next Batch Calls: 3029

                        5: Node Blocks: 1008

                      + Cost Estimates:Estimated Node Cardinality: 4.77218624E8

                      + Query:SELECT wsmodel.metric_data.agent_name FROM wsmodel.metric_data WHERE (wsmodel.metric_data.ts >= {ts'2018-04-07 00:00:00.0'}) AND (wsmodel.metric_data.ts <= {ts'2018-04-07 05:00:00.0'})

                      + Model Name:wsmodel

                  + Data Bytes Sent:106

                  + Planning Time:11

                • 5. Re: Distinct keyword and like_regex performance is very slow
                  Steven Hawkins Master

                  like_regex does not appear in the above.  It does show most of time is spent in retrieval - the cumulative time spent in the Teiid nodes is 2663 milliseconds.  Since we use a tree structure to remove duplicates, you should start seeing results shortly after the first batch is pulled from the source.

                  • 6. Re: Distinct keyword and like_regex performance is very slow
                    Avinash Chandwani Newbie

                    Our No-SQL source doesn't take any where clause except time range, its a time series database.

                     

                    We send the sql query with only time range in the where clause, and in the response returned, we reply upon TEIID to apply distinct, like_regex and other where clauses.

                     

                    This is because our no-sql source, is not designed to support more number of where clauses.

                     

                    Apologies, i had missed like_regex query's plan.

                    SQL Query: select * from numerical_metric_data where ts between '2018-04-07 00:00:00' and '2018-04-07 05:00:00' and metric_path like_regex '.*thieves.*'

                    Explain Plan :

                    ProjectNode

                      + Relational Node ID:0

                      + Output Columns:

                        0: attribute_type (long)

                        1: source_name (string)

                        2: agent_host (string)

                        3: agent_process (string)

                        4: agent_name (string)

                        5: domain_name (string)

                        6: metric_path (string)

                        7: metric_attribute (string)

                        8: frequency (long)

                        9: ts (timestamp)

                        10: min_value (long)

                        11: max_value (long)

                        12: agg_value (long)

                        13: value_count (long)

                      + Statistics:

                        0: Node Output Rows: 256

                        1: Node Next Batch Process Time: 0

                        2: Node Cumulative Next Batch Process Time: 8

                        3: Node Cumulative Process Time: 0

                        4: Node Next Batch Calls: 2

                        5: Node Blocks: 1

                      + Cost Estimates:Estimated Node Cardinality: 1.5907288E8

                      + Child 0:

                        SelectNode

                          + Relational Node ID:1

                          + Output Columns:

                            0: attribute_type (long)

                            1: source_name (string)

                            2: agent_host (string)

                            3: agent_process (string)

                            4: agent_name (string)

                            5: domain_name (string)

                            6: metric_path (string)

                            7: metric_attribute (string)

                            8: frequency (long)

                            9: ts (timestamp)

                            10: min_value (long)

                            11: max_value (long)

                            12: agg_value (long)

                            13: value_count (long)

                          + Statistics:

                            0: Node Output Rows: 256

                            1: Node Next Batch Process Time: 0

                            2: Node Cumulative Next Batch Process Time: 5

                            3: Node Cumulative Process Time: 0

                            4: Node Next Batch Calls: 2

                            5: Node Blocks: 1

                          + Cost Estimates:Estimated Node Cardinality: 1.5907288E8

                          + Child 0:

                            AccessNode

                              + Relational Node ID:2

                              + Output Columns:

                                0: metric_path (string)

                                1: attribute_type (long)

                                2: source_name (string)

                                3: agent_host (string)

                                4: agent_process (string)

                                5: agent_name (string)

                                6: domain_name (string)

                                7: metric_attribute (string)

                                8: frequency (long)

                                9: ts (timestamp)

                                10: min_value (long)

                                11: max_value (long)

                                12: agg_value (long)

                                13: value_count (long)

                              + Statistics:

                                0: Node Output Rows: 256

                                1: Node Next Batch Process Time: 0

                                2: Node Cumulative Next Batch Process Time: 0

                                3: Node Cumulative Process Time: 0

                                4: Node Next Batch Calls: 2

                                5: Node Blocks: 1

                              + Cost Estimates:Estimated Node Cardinality: 4.77218624E8

                              + Query:SELECT wsmodel.metric_data.metric_path, wsmodel.metric_data.attribute_type, wsmodel.metric_data.source_name, wsmodel.metric_data.agent_host, wsmodel.metric_data.agent_process, wsmodel.metric_data.agent_name, wsmodel.metric_data.domain_name, wsmodel.metric_data.metric_attribute, wsmodel.metric_data.frequency, wsmodel.metric_data.ts, wsmodel.metric_data.min_value, wsmodel.metric_data.max_value, wsmodel.metric_data.agg_value, wsmodel.metric_data.value_count FROM wsmodel.metric_data WHERE (wsmodel.metric_data.ts >= {ts'2018-04-07 00:00:00.0'}) AND (wsmodel.metric_data.ts <= {ts'2018-04-07 05:00:00.0'})

                              + Model Name:wsmodel

                          + Criteria:wsdb.metric_path LIKE_REGEX '.*thieves.*'

                      + Select Columns:

                        0: wsdb.attribute_type

                        1: wsdb.source_name

                        2: wsdb.agent_host

                        3: wsdb.agent_process

                        4: wsdb.agent_name

                        5: wsdb.domain_name

                        6: wsdb.metric_path

                        7: wsdb.metric_attribute

                        8: wsdb.frequency

                        9: wsdb.ts

                        10: wsdb.min_value

                        11: wsdb.max_value

                        12: wsdb.agg_value

                        13: wsdb.value_count

                      + Data Bytes Sent:25131

                      + Planning Time:5

                    • 7. Re: Distinct keyword and like_regex performance is very slow
                      Steven Hawkins Master

                      > This is because our no-sql source, is not designed to support more number of where clauses.

                       

                      Understood, that seems like a perfectly fine usage scenario.

                       

                      From the first query - I would expect that the dup removal operation to have a more minimal cost based upon there only being 3 final rows - are your timing consistent, and have you used a later server version to compare?  In any event the performance gain will be minimal even if the dup removal cost were lower as plan we see that over 90% of the time is spent in getting results from the source. 

                       

                      Can you provide the plan for the second query above after all rows have been processed?  That appears to from only after the first batch has been delivered, so the timings are incomplete.

                      • 8. Re: Distinct keyword and like_regex performance is very slow
                        Avinash Chandwani Newbie

                        We are streaming the results from source, its a REST API, so is that the reason to take most time to fetch the data from source?

                         

                        We have implemented ProcedureExecution for interfacing with source Webservice Nosql DB. Implemented

                        ==============================================

                        public List<?> next() throws TranslatorException, DataNotAvailableException method as:

                        if(returnValue.getInputStream() != null) {

                            if (this.jsonParser == null) {

                            recordPublishStartTime = System.currentTimeMillis();

                            this.jsonParser = this.parseJson.apply(this.returnValue

                            .getInputStream());

                            }

                         

                            if (jsonParser.nextToken() != JsonToken.END_ARRAY) {

                            // read the record into a tree model,

                            // this moves the parsing position to the end of it

                            JsonNode node = jsonParser.readValueAsTree();

                            return this.rowFactory.apply((JsonElement) node);

                         

                            }

                        ===============================

                        So do you think this is optimum way to do processing? Please let me know if anything better can be done.

                         

                        Major concern is with distinct and like_regex keyword, so here is the query for query containing both.

                         

                        When i had 3 distinct column values in the source, the explain plan is like :

                         

                        DupRemoveNode

                          + Relational Node ID:0

                          + Output Columns:agent_name (string)

                          + Statistics:

                            0: Node Output Rows: 3

                            1: Node Next Batch Process Time: 13

                            2: Node Cumulative Next Batch Process Time: 13

                            3: Node Cumulative Process Time: 10746

                            4: Node Next Batch Calls: 7

                            5: Node Blocks: 6

                          + Cost Estimates:Estimated Node Cardinality: 4.77218624E8

                          + Child 0:

                            AccessNode

                              + Relational Node ID:1

                              + Output Columns:agent_name (string)

                              + Statistics:

                                0: Node Output Rows: 11273

                                1: Node Next Batch Process Time: 0

                                2: Node Cumulative Next Batch Process Time: 0

                                3: Node Cumulative Process Time: 10746

                                4: Node Next Batch Calls: 18

                                5: Node Blocks: 6

                              + Cost Estimates:Estimated Node Cardinality: 4.77218624E8

                              + Query:SELECT wsmodel.metric_data.agent_name FROM wsmodel.metric_data WHERE (wsmodel.metric_data.ts >= {ts'2018-04-16 19:13:00.0'}) AND (wsmodel.metric_data.ts <= {ts'2018-04-16 19:14:00.0'})

                              + Model Name:wsmodel

                          + Data Bytes Sent:106

                          + Planning Time:18

                         

                         

                        12292

                         

                        At customer end there are lot of records: So here is the explain plan for dupremoval node processing:

                        DupRemoveNode

                        + Relational Node ID:0

                        + Output Columns:agent_name (string)

                        + Statistics:

                        0: Node Output Rows: 158

                        1: Node Next Batch Process Time: 274

                        2: Node Cumulative Next Batch Process Time: 322

                        3: Node Cumulative Process Time: 10209

                        4: Node Next Batch Calls: 782

                        5: Node Blocks: 781

                        + Cost Estimates:Estimated Node Cardinality: 4.77218624E8

                        + Child 0:

                        AccessNode

                        + Relational Node ID:1

                        + Output Columns:agent_name (string)

                        + Statistics:

                        0: Node Output Rows: 1600011

                        1: Node Next Batch Process Time: 48

                        2: Node Cumulative Next Batch Process Time: 48

                        3: Node Cumulative Process Time: 10209

                        4: Node Next Batch Calls: 2344

                        5: Node Blocks: 781

                        + Cost Estimates:Estimated Node Cardinality: 4.77218624E8

                        + Query:SELECT wsmodel.metric_data.agent_name FROM wsmodel.metric_data WHERE (wsmodel.metric_data.ts >= {ts'2018-04-18 07:24:00.0'}) AND (wsmodel.metric_data.ts <= {ts'2018-04-18 07:26:00.0'})

                        + Model Name:wsmodel

                        + Data Bytes Sent:4742

                        + Planning Time:5

                         

                        Also i was looking at the TEIID libraries, and find that compiled regex patterns are cached in the memory, cache size being 100. So pre way to execute the compilation of regex patterns, which can help executing query bit faster?

                        • 9. Re: Distinct keyword and like_regex performance is very slow
                          Steven Hawkins Master

                          > So do you think this is optimum way to do processing? Please let me know if anything better can be done.

                           

                          There's nothing wrong from a Teiid perspective, but it's hard to say if there is something better from just that snippet.  You should do some profiling to see how much time is spent getting the stream and how much time is spent in the row processing.

                           

                          > At customer end there are lot of records: So here is the explain plan for dupremoval node processing:

                           

                          Both of these plans show very little of the overall time being spent in dup remove processing.

                           

                          > So pre way to execute the compilation of regex patterns, which can help executing query bit faster?

                           

                          Yes the engine will use a compiled form of the expression if possible.  Are you asking how much of an impact that has on performance?