I've got some extra time on my hands lately and I thought I'd contribute some performance instrumentation.
Lately, our team has been hit by a few different performance problems that aren't directly attributed to Teiid but our lack of visibility into how the engine and translators process queries makes getting to the root of the problem more difficult than it needs to be. Fundamentally, we needed to surface a few key pieces of information and correlate those with user activity:
- The query plan
- The translators involved in the plan
- The work that the translators are doing:
- CPU & latency of ExecutionFactory.execute() and ExecutionFactory.getConnection()
- CPU & latency of Execution.execution(), cancel() and close()
- Cumulative CPU & latency of ResultSetExecution.next()
- Total number of rows returned from ResultSetExecution.next()
Basically, I see two parts to this. More convenient plan logging and engine performance logging.
To improve plan logging, I see adding a new log option that, when enabled, will log each unique, normalized user query and its plan. I was thinking of using a bloom filter to maintain the set of normalized user queries to keep the memory requirements low. If the query doesn't already exist in the set, log the plan and the user query. If I can find the right point in the code where the normalized query is already represented as a string, the CPU hit should be pretty small.
Logging the engine performance is a bit more involved. Inside the engine, instrument DataTierManager.registerRequest() to log the user, user query and request id. In order to capture translator performance, it looks like the leverage point is ConnectorWorkItem, mostly in the execute() and handleBatch() methods. From these two, log the request id, the part identifier, the CPU and latency consumed by the translator.
Obviously, the missing piece is the work that Teiid does to fulfill the user query. I haven't found a good leverage point for that but I'll keep looking (hints gladly accepted!). Fortunately, Teiid itself hasn't been the source of performance problems so I consider it a lower priority.
What do people think? Am I on the right track or do you have suggestions or alternatives?
cpu.patch.zip 3.3 KB