10 Replies Latest reply on May 2, 2014 8:39 PM by shawkins

    Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4

    anilallewar

      I had some confusion regarding the various parameters that affect the movement of data from the source systems => teiid => teiid client. Can you please verify whether the understanding is correct and clarify any questions?

       

      1. processor-batch-size - The documentation says "The max row count of a batch sent internally within the query processor. Should be <;= the connectorBatchSize. (default 512)". So is this the tuple size for the batch stored in buffer manager and the number of rows sent back in each batch to the client? When I used to CLI to get the runtime values, it indicated that the default value is 256. So is the "jboss-teiid.xsd" document out of sync?
      2. connector-batch-size - The documentation says "The max row count of a batch from a connector. Should be even multiple of processorBatchSize. (default 1024)". So does this gets turned to the fetch size on the connector i.e. say I am using Oracle, then is this the fetch size for the oracle JDBC driver?
      3. max-row-fetch-size - The documentation says "Maximum allowed fetch size, set via JDBC. User requested value ignored above this value. (default 20480)". If this is the fetch size to be used on the Teiid JDBC driver then the default value is 2048 (browsed source code)? Is 20480 the limit above which the client can't set the fetch size?

       

      Thanks,

      Anil

        • 1. Re: Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4
          anilallewar

          When I debugged through the translator is passed the fetch size of the request message(assuming it is 2 * BufferManager.DEFAULT_PROCESSOR_BATCH_SIZE) in the org.teiid.dqp.internal.datamgr.ConnectorWorkItem" class. Eventually this fetch size gets set on the source statement using the "

          org.teiid.translator.jdbc.JDBCExecutionFactory.setFetchSize(Command command, ExecutionContext context, Statement statement, int fetchSize) throws SQLException" method.


          So it looks like the connector-batch-size is not really used to gather data from the source; can you please clarify it's significance?

          • 2. Re: Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4
            rareddy

            It has been removed as part of [TEIID-2316] Give translators a better batch size - JBoss Issue Tracker, I believe we need to correct the documentation.

             

            Processor batch size not client fetch size, that is how internal tuple batch size is for processing.

             

            max-row-fetch-size, yes, I believe to avoid the memory exhaustion. Any way JDBC fetch size is what user recommends, that does mean all the batches sent in this size all the time. If it falls in between internal batch sizes server will only return up to that batch.

            • 3. Re: Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4
              anilallewar


              Ramesh,

               

              I debugged through the code and found that you set the source fetch size to twice the processor batch size. However I am curious to know how do you determine the working batch size from the processor batch size and the estimated data size. As stated in the JIRA issue


              "With the default processor batch size is 256, which means that the connector will be asked for batch sizes between 64 and 4096 rows."

               

              I am seeing that now Teiid returns results faster than Oracle if the number of objects to be returned are large (large data size) and this was not happening with previous version of the code.

               

              Anil

              • 4. Re: Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4
                shawkins

                > However I am curious to know how do you determine the working batch size from the processor batch size and the estimated data size.

                 

                You can see the routine in BufferManagerImpl getSizeEstimates -  teiid/engine/src/main/java/org/teiid/common/buffer/impl/BufferManagerImpl.java at master · teiid/teiid · GitHub Basically we assume a nominal data width of 2k per row, then adjust the working batch size up or down when the data seems much larger or much smaller than that value.

                 

                > I am seeing that now Teiid returns results faster than Oracle if the number of objects to be returned are large (large data size) and this was not happening with previous version of the code.

                 

                That sounds like a positive result.

                • 5. Re: Re: Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4
                  anilallewar

                  Steve,

                   

                  I tried to follow the code but couldn't find exactly where the batch size is changed; my guess is the org.teiid.dqp.internal.process.RequestWorkItem.sendResultsIfNeeded(TupleBatch batch) method at the code snippet below.

                   

                  int rowSize = resultsBuffer.getRowSizeEstimate();
                  int batches = CLIENT_FETCH_MAX_BATCHES;
                  if (rowSize > 0) {
                        int totalSize = rowSize * resultsBuffer.getBatchSize();
                        if (schemaSize == 0) {
                             schemaSize = this.dqpCore.getBufferManager().getSchemaSize(this.originalCommand.getProjectedSymbols());
                        }
                        int multiplier = schemaSize/totalSize;
                        if (multiplier > 1) {
                             batches *= multiplier;
                        }
                  }
                  

                   

                  So continuing my ignorance, the assumption is that the size of the batch returned to the client would be higher if the rowsize is less. Again this impacts the size of batches stored in the BufferManager and consequently the number of rows sent per batch to the client. The fetch size of the source statement (in case of JDBC access) still remain (processor-batch-size * 2) since that was set when we executed the query against the source.

                   

                  How is the client fetch size related to this? Say my fetch size (on Teiid statement) is 512 and the processor-batch-size is 256, then 2 batches are sent to the client for each next batch request from the client?

                   

                  Anil

                  • 6. Re: Re: Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4
                    shawkins

                    > I tried to follow the code but couldn't find exactly where the batch size is changed; my guess is the org.teiid.dqp.internal.process.RequestWorkItem.sendResultsIfNeeded(TupleBatch batch) method at the code snippet below.

                     

                    That code is responsible for allowing more than 1 working batch to be sent to the client as part of a fetch.

                     

                    > How is the client fetch size related to this? Say my fetch size (on Teiid statement) is 512 and the processor-batch-size is 256, then 2 batches are sent to the client for each next batch request from the client?

                     

                    From the processor-batch-size we will first compute the working batch size - smaller data widths will be larger, larger will be smaller.  The place where this counts is on the output buffer, but we actually compute a working batch size for all tuple buffers used in the query processing.

                     

                    We then consider the fetch size on the statement as the maximal amount of row to send to the client (as described by JDBC it's a hint, not a contract).  We can send less than the fetch size.  Typically that is when the fetch size is larger than the working batch size and the client is effectively ahead of the server - that is processing results as fast as they can be feed.  If the fetch size is large enough, then that is when the logic above will allow for more than working batch (up to the CLIENT_FETCH_MAX_BATCHES) to be sent together.

                    • 7. Re: Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4
                      anilallewar

                      > From the processor-batch-size we will first compute the working batch size

                       

                      So if I am aware of the data characteristics, would setting a higher/lower processor-batch-size have any impact on the performance other than capping the working batch size? From the description above, it looks like the processor-batch-size only caps the max rows sent within the query processor while the working batch size is used for tuple buffers.

                       

                      (up to the CLIENT_FETCH_MAX_BATCHES) to be sent together

                       

                      The number of batch seem to be CLIENT_FETCH_MAX_BATCHES * multiplier meaning then can be more than the CLIENT_FETCH_MAX_BATCHES.

                       

                      > The fetch size of the source statement (in case of JDBC access) still remain (processor-batch-size * 2) since that was set when we executed the query against the source.

                       

                      I am assuming this is correct and we at-least read one processor batch ahead.

                      • 8. Re: Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4
                        shawkins

                        > From the description above, it looks like the processor-batch-size only caps the max rows sent within the query processor while the working batch size is used for tuple buffers.

                         

                        The processor-batch-size is not a cap. It is a base value around which the working batch size is computed - for both processing nodes and tuple buffers.

                         

                        > The number of batch seem to be CLIENT_FETCH_MAX_BATCHES * multiplier meaning then can be more than the CLIENT_FETCH_MAX_BATCHES.

                         

                        Yes it can.  Things are slightly more complicated the first level description in that we can deal with a priori estimate of the data width (just looking at the type information), and a more accurate sample based upon the data that is actually read.  The logic there is able to look at the sample data width to further adjust how many working batches seem like a good idea to send to the client at once.

                         

                        There are two reasons why we simply don't honor the client fetch size:

                        1. We don't want to unnecessarily introduce latency by holding results until we reach the fetch size.

                        2. The memory usage is effectively untracked.  The results messages are being handed off to the netty nio layer to be sent to the client.  Unless we introduced a phantom reference tracking we wouldn't reliable know when the data had been sent such that the results were gc eligible.

                         

                        > I am assuming this is correct and we at-least read one processor batch ahead.

                         

                        It will actually be 2*(working batch size), where the working batch size is computed for the data width of the source query.  But yes we are effectively trying to read two batches at once and perform a read-ahead.

                        • 9. Re: Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4
                          anilallewar

                          > The processor-batch-size is not a cap.


                          The documentation kind of states it's a cap. I am curios to know if it's really useful for a Teiid server admin to tweak its value.

                          "buffer-service-processor-batch-size" => {

                                          "type" => INT,

                                          "description" => "The max row count of a batch sent internally within the query processor. (default 256)",

                           

                          This is my set of instructions for an administrator trying to tune Teiid and It would help if you can validate if the information is correct w.r.t. someone who is trying to tweak Teiid performance.

                           

                          1.      processor-batch-size (Default 256)

                            1. Determines the maximum size of batch in terms of number of rows send to the client.
                            2. So in case you have tables that have less data per row then you can increase the processor batch size to reduce the number of round trips to the server.
                            3. Conversely if you have large number of columns in the tables (and hence large row size) then consider decreasing the processor batch size.
                            4. Teiid will read at-least one batch ahead on the source; hence the source fetch size in case of JDBC will be (2 * processor-batch-size)
                          • 10. Re: Clarification on parameters for batch size sent to client/handled internally by Teiid 8.4
                            shawkins

                            > The documentation kind of states it's a cap


                            Yes that needs updated.

                             

                            > I am curios to know if it's really useful for a Teiid server admin to tweak its value.


                            Yes it can be.  See Memory Management - Teiid 8.8 (draft) - Project Documentation Editor


                            > Determines the maximum size of batch in terms of number of rows send to the client.

                            > So in case you have tables that have less data per row then you can increase the processor batch size to reduce the number of round trips to the server.


                            Ideally you don't want to start with these considerations.  One of the reasons for the server to use a working batch size is so that we'll use a larger batch size for queries with smaller data widths.  The client does perform a concurrent prefetch (you are processing on fetch while the next one is being retrieved) and a 256 row batch with something typical like 2kb / row is still a 512kb + results message.  So unless you are dealing only with million row queries, with narrow results, and a slow network that won't be an issue - and even then if you set the batch size too high you can cause memory issues especially with heavy concurrent load.


                            > Conversely if you have large number of columns in the tables (and hence large row size) then consider decreasing the processor batch size.


                            Again that is why we compute a working batch size per operation so that you don't have to attempt to find a one size fits all value.


                            > Teiid will read at-least one batch ahead on the source; hence the source fetch size in case of JDBC will be (2 * processor-batch-size)


                            To be pedantic it will be twice the working batch size.  Which can scale up or down 3 powers of 2 from the processor batch size (for example with 256 it could be between 32 and 2048 rows).