VDB metadata problem during DQP request processing
hungrykoala May 15, 2017 11:48 AMHello Teiid folks,
Our application has been intermittently showing some very weird behavior, and I am starting this discussion with the hope of getting some hints, further to what we have already managed to analyze.
Unfortunately, we are still not able to reproduce the problem, and it only appeared on one of our environments, which makes us think that it might be related to some race condition.
We have a standalone application, which upon start-up, creates a certain VDB based on certain data sources, and then starts an Embedded Teiid server, that deploys the VDB and exposes a JDBC endpoint for external clients. Perhaps a pretty much standard setup. We use JBoss Teiid 8.10.0.final
The issue we saw happening twice, was that our application's JDBC endpoint became fully unresponsive (even attempts to connect to it failed) because the metadata for the VDB was null:
{code}
2017-02-14 11:18:12,250 ERROR org.teiid.PROCESSOR: TEIID30019 Unexpected exception for request <request_id>
org.teiid.core.TeiidComponentException: TEIID30489 Unable to load metadata for VDB <vdb_name>.
at org.teiid.dqp.internal.process.Request.initMetadata(Request.java:191)
at org.teiid.dqp.internal.process.Request.processRequest(Request.java:436)
at org.teiid.dqp.internal.process.RequestWorkItem.processNew(RequestWorkItem.java:627)
at org.teiid.dqp.internal.process.RequestWorkItem.process(RequestWorkItem.java:326)
at org.teiid.dqp.internal.process.AbstractWorkItem.run(AbstractWorkItem.java:51)
at org.teiid.dqp.internal.process.RequestWorkItem.run(RequestWorkItem.java:265)
at org.teiid.dqp.internal.process.DQPWorkContext.runInContext(DQPWorkContext.java:276)
at org.teiid.dqp.internal.process.ThreadReuseExecutor$RunnableWrapper.run(ThreadReuseExecutor.java:119)
at org.teiid.dqp.internal.process.ThreadReuseExecutor$3.run(ThreadReuseExecutor.java:210)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
This stack-trace got logged on each incoming JDBC request, so all DQP threads could not process the incoming requests, as VDB was corrupted. However, since ThreadReuseExecutor was used, these exceptions were caught by the thread-management layer, logged, and then the threads were ready to be reused. In other words, the exception never made it as far as to exit the thread. On the contrary, threads were alive, but requests that they served just never made it past this initMetadata() step.
Restarting our application fixed the issue.
VDB with null metadata is probably a bad/corrupted state. It should rather mean that this VDB should had never gotten successfully deployed to Teiid in the first place. Indeed, taking a closer look into our logs revealed that few minutes before this happened, there was a problem with the VDB deployment:
{code}
....
at org.teiid.query.metadata.NativeMetadataRepository.getMetadata(NativeMetadataRepository.java:83)
at org.teiid.query.metadata.NativeMetadataRepository.loadMetadata(NativeMetadataRepository.java:60)
at org.teiid.query.metadata.ChainingMetadataRepository.loadMetadata(ChainingMetadataRepository.java:55)
at org.teiid.runtime.EmbeddedServer.loadMetadata(EmbeddedServer.java:762)
at org.teiid.runtime.AbstractVDBDeployer.loadMetadata(AbstractVDBDeployer.java:168)
at org.teiid.runtime.EmbeddedServer.deployVDB(EmbeddedServer.java:712)
...
{code}
Our application handles this case: upon hitting a failed VDB deployment, our application exits its main thread. However, we are not sure about DQP's threads: if they are daemon threads, they won't be waited from the JVM to exit. So the question here is: when using Embedded Teiid what is the daemon flag of the threads it starts/controls? Could a JVM exit be delayed due to lingering Teiid threads?
The DQP thread management layer leverages ThreadReuseExecutor for managing the request worker threads, and taking a look at some of the JavaDoc there, I get a feeling that certain race conditions are possible in this class.
So our main analysis points for the issue's root cause so far are:
- lingering Teiid threads: the DQP worker pool remained alive, although our application's main thread exited due to the failed VDB deployment. If this was the root cause, how come we cannot reproduce the problem locally? - every time we have a failed VDB deployment locally, the whole JVM exits successfully, and no traces from Teiid are being left whatsoever;
- some weird race condition in DQP thread management layer (maybe the DQPWorkContext ThreadLocals are not correctly cleaned up between reusing threads?)
What do you think of this situation? Ever hit a similar problem?
All my best,
Krum.