Teiid client 8.10.0.Final intermittently hangs when processing multi-batch query to Teiid server 9.3.4
hungrykoala Dec 15, 2017 3:10 PMHello Teiid community,
Allow me to bring to your attention a recent problem we have started facing between two applications, one acting as a Teiid client (version is 8.10.0.Final), and the other one being a Teiid server (version 9.3.4). The client connects to the server using server's JDBC endpoint, and leveraging Apache DBCP connection pool; it then fires off SQL requests to it. If that might be relevant to the scenario, the server instances actually sit behind a VIP load balancer, which is what the client refers to (rather than directly to some server instance).
The major source of complexity here is that the problem we're hitting is very difficult to reproduce, it takes hours of persistently triggering requests to reach a single occurrence of the issue. Moreover, the described setup has been like that for long time now, while hitting these problems started happening only recently.
The problem is that certain queries, usually being served within a couple of seconds' time, intermittently time out after 1 hour. We've taken some regular thread dumps from the client process, during that one hour period, and here's the client thread's call stack:
Worker16045_QueryProcessorQueue1316447" - Thread t@108468 java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <78696d53> (a org.teiid.core.util.AccessibleBufferedInputStream) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.teiid.netty.handler.codec.serialization.ObjectDecoderInputStream.readObjectOverride(ObjectDecoderInputStream.java:97) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:365) at org.teiid.net.socket.OioOjbectChannelFactory$OioObjectChannel.read(OioOjbectChannelFactory.java:118) - locked (a java.lang.Object) at org.teiid.net.socket.SocketServerInstanceImpl.read(SocketServerInstanceImpl.java:307) at sun.reflect.GeneratedMethodAccessor75.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.teiid.net.socket.SocketServerConnectionFactory$ShutdownHandler.invoke(SocketServerConnectionFactory.java:98) at com.sun.proxy.$Proxy75.read(Unknown Source) at org.teiid.net.socket.SocketServerInstanceImpl$RemoteInvocationHandler$1.get(SocketServerInstanceImpl.java:406) at org.teiid.jdbc.ResultSetImpl.getResults(ResultSetImpl.java:439) at org.teiid.jdbc.ResultSetImpl.requestBatch(ResultSetImpl.java:373) at org.teiid.jdbc.BatchResults.requestBatchAndWait(BatchResults.java:227) at org.teiid.jdbc.BatchResults.requestNextBatch(BatchResults.java:142) at org.teiid.jdbc.BatchResults.hasNext(BatchResults.java:253) at org.teiid.jdbc.ResultSetImpl.hasNext(ResultSetImpl.java:500) at org.teiid.jdbc.ResultSetImpl.next(ResultSetImpl.java:254) at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:191) at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:191) at org.teiid.dqp.internal.datamgr.ConnectorWorkItem.handleBatch(ConnectorWorkItem.java:396) at org.teiid.dqp.internal.datamgr.ConnectorWorkItem.more(ConnectorWorkItem.java:214) - locked <61e500d> (a org.teiid.dqp.internal.datamgr.ConnectorWorkItem) at sun.reflect.GeneratedMethodAccessor79.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.teiid.dqp.internal.datamgr.ConnectorManager$1.invoke(ConnectorManager.java:209) at com.sun.proxy.$Proxy73.more(Unknown Source) at org.teiid.dqp.internal.process.DataTierTupleSource.getResults(DataTierTupleSource.java:301) at org.teiid.dqp.internal.process.DataTierTupleSource$1.call(DataTierTupleSource.java:110) at org.teiid.dqp.internal.process.DataTierTupleSource$1.call(DataTierTupleSource.java:107) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.teiid.dqp.internal.process.FutureWork.run(FutureWork.java:58) at org.teiid.dqp.internal.process.DQPWorkContext.runInContext(DQPWorkContext.java:276) at org.teiid.dqp.internal.process.ThreadReuseExecutor$RunnableWrapper.run(ThreadReuseExecutor.java:119) at org.teiid.dqp.internal.process.ThreadReuseExecutor$3.run(ThreadReuseExecutor.java:210) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - locked <4465cde1> (a java.util.concurrent.ThreadPoolExecutor$Worker)
(The reason why this call stack is a bit involved, and served by Teiid DQP logic, is because our client application also acts as a Teiid server, and simply delegates incoming queries to the Teiid server. However this is not relevant to the problem, as we have managed to reproduce the issue via a simple JMeter-powered JDBC client against the Teiid server).
This stays like that for the whole 1 hour, from which it is evident that while asking the server for the next batch of the result set, the client blocks on reading the data from the network socket, hangs, and eventually times out.
We raised the level of the server loggers to debug, and here is a snippet from the activity of the server-side thread that handled this request:
2017-12-15 12:00:45.625 [NIO36 ] DEBUG - processing message:MessageHolder: key=550774 contents=Invoke interface org.teiid.client.DQP.executeRequest 2017-12-15 12:00:47.730 [ocessorQueue14017479] DEBUG - send message: MessageHolder: key=550774 contents=ResultsMessage rowCount=256 finalRow=-1 2017-12-15 12:00:47.737 [NIO36 ] DEBUG - processing message:MessageHolder: key=550776 contents=Invoke interface org.teiid.client.DQP.processCursorRequest 2017-12-15 12:00:47.738 [ocessorQueue14017553] DEBUG - send message: MessageHolder: key=550776 contents=ResultsMessage rowCount=2048 finalRow=-1 2017-12-15 12:00:47.755 [NIO36 ] DEBUG - processing message:MessageHolder: key=550777 contents=Invoke interface org.teiid.client.DQP.processCursorRequest 2017-12-15 12:00:47.756 [ocessorQueue14017554] DEBUG - send message: MessageHolder: key=550777 contents=ResultsMessage rowCount=2048 finalRow=-1 2017-12-15 12:00:48.389 [NIO36 ] DEBUG - processing message:MessageHolder: key=550778 contents=Invoke interface org.teiid.client.DQP.processCursorRequest 2017-12-15 12:00:48.389 [ocessorQueue14017556] DEBUG - send message: MessageHolder: key=550778 contents=ResultsMessage rowCount=2048 finalRow=-1 2017-12-15 12:00:48.825 [NIO36 ] DEBUG - processing message:MessageHolder: key=550779 contents=Invoke interface org.teiid.client.DQP.processCursorRequest 2017-12-15 12:00:48.826 [ocessorQueue14017557] DEBUG - send message: MessageHolder: key=550779 contents=ResultsMessage rowCount=2048 finalRow=-1 2017-12-15 12:00:48.932 [NIO36 ] DEBUG - processing message:MessageHolder: key=550780 contents=Invoke interface org.teiid.client.DQP.processCursorRequest 2017-12-15 12:00:48.933 [ocessorQueue14017558] DEBUG - send message: MessageHolder: key=550780 contents=ResultsMessage rowCount=2048 finalRow=-1 2017-12-15 12:02:18.386 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:05:55.327 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:09:29.545 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:13:08.903 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:16:41.864 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:20:16.875 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:23:57.860 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:27:30.273 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:31:07.795 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:34:46.439 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:38:23.325 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:41:59.227 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:45:35.526 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:49:14.088 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:52:46.177 [NIO36 ] DEBUG - Channel closed 2017-12-15 12:56:24.407 [NIO36 ] DEBUG - Channel closed 2017-12-15 13:00:03.109 [NIO36 ] DEBUG - Channel closed 2017-12-15 13:00:52.143 [NIO36 ] DEBUG - TEIID40114 Unhandled exception, closing client instance: Connection reset by peer java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:880) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) at java.lang.Thread.run(Thread.java:745) 2017-12-15 13:00:52.143 [NIO36 ] DEBUG - Channel closed 2017-12-15 13:03:34.811 [NIO36 ] DEBUG - Channel closed
Due to specifics of the logging format, the classes that these messages originated from are not logged. However I believe that:
- 'Channel closed' is logged from org.teiid.transport.SSLAwareChannelHandler class, and its channelInactive method;
- 'MessageHolder' logs are from org.teiid.net.socket.Message;
- the log lines that contain rowCount, finalRow, etc. are coming from org.teiid.transport.SocketClientInstance.
In the beginning NIO36 thread successfully served batches to the client, between 12:00:45.617 and 12:00:48.932, which is approx. 3 seconds, after which it started logging only following line, roughly once every 3.5 mins:
2017-12-15 12:02:18.386 [NIO36 ] DEBUG - Channel closed
which continued all the way up to 13:00:52.143 when it hit the exception problem. As there is no record in the logs with finalRow being positive (only finalRow=-1), it is evident that the last batch was never served to the client, hence the result set never got fully exhausted at the server side.
Finally, the exception that gets logged at client side after 1 hour is as attached. Basically the connection timeout expires, and likely the client resets it - which is what the server logged.
Could you please let us know if that behaviour rings some bells, have you observed something like that before?
Is the 9.3.4 server compatible with the 8.10.0.Final client?
Is this a matter of configuring the client/server so that the multi-batch query gets successfully served?
Any hints where/what to look at?
All my best,
Krum.
-
ClientSideExceptions.txt.zip 2.4 KB