6 Replies Latest reply on Jun 24, 2019 11:54 AM by galder.zamarreno

    Intermittent SocketTimeoutException during GetOperation

    prateekk.in

      Cross posted at distributed caching - Infinispan throwing intermittent SocketTimeoutException during GetOperation - Stack Overflow

       

      Using jboss/infinispan-server:9.4.0.Final (Docker Hub) in standalone server mode with replication, and the clients connect via hotrod.

       

      Here's from our gradle file:

      infinispan_version = "9.4.0.Final"

      ...

      'infinispan': [

      "org.infinispan:infinispan-core:$infinispan_version",

      "org.infinispan:infinispan-client-hotrod:$infinispan_version"

      ],

       

      Here is how each client code (java) connects to the cache over hotrod:

       

      ConfigurationBuilder builder = new ConfigurationBuilder();

      builder.forceReturnValues(false).addServer().host("infinispan_hostname").port(11222);

      BasicCacheContainer manager = new RemoteCacheManager(builder.build(), true);

      BasicCache<Object, Object> ourCache = manager.getCache("OUR_CACHE");

       

      Here's the relevant xml:

       

      <subsystem xmlns="urn:infinispan:server:core:9.3" default-cache-container="local">

                  <cache-container name="local" default-cache="default" statistics="true">       

                     <global-state/>

       

                      <local-cache name="OUR_CACHE">

                                     <expiration lifespan="-1"/>

                                     <locking isolation="SERIALIZABLE" acquire-timeout="30000" concurrency-level="1000" striping="false"/>

                                     <file-store relative-to="cachestore.root" path="server" max-entries="-1" purge="false" passivation="false" preload="true" fetch-state="true"/>

                                     <memory>

                                           <binary size="100000000" eviction="MEMORY"/>

                                     </memory>

                                     </local-cache>

                 </cache-container>

      </subsystem>

       

      Occasionally, we get the following exception at DEBUG level (say after every 30 min), but most of the time it works as expected. It seems to be waiting synchronously at the ourCache.get("key") method till the timeout of 60s.

      However, if we retry the operation immediately afterwards, it works correctly and there's no error. Was hoping someone could please help me understand what is the cause.

       

      2019-03-06 07:15:47,668 ERROR:Error occurred when .."our app data message". : java.net.SocketTimeoutException: GetOperation{OUR_CACHE, key=[B0x033E22742D663039..[37], flags=0} timed out after 60000 ms

      org.infinispan.client.hotrod.exceptions.TransportException: java.net.SocketTimeoutException: GetOperation{OUR_CACHE, key=[B0x033E22742D663039..[37], flags=0} timed out after 60000 ms

      at org.infinispan.client.hotrod.impl.Util.rewrap(Util.java:54) ~[infinispan-client-hotrod-9.4.0.Final.jar:9.4.0.Final]

      at org.infinispan.client.hotrod.impl.Util.await(Util.java:27) ~[infinispan-client-hotrod-9.4.0.Final.jar:9.4.0.Final]

      at org.infinispan.client.hotrod.impl.RemoteCacheImpl.get(RemoteCacheImpl.java:418) ~[infinispan-client-hotrod-9.4.0.Final.jar:9.4.0.Final]

      at ... our app jar's stacktrace

      at javax.servlet.http.HttpServlet.service(HttpServlet.java:661) [servlet-api.jar:?]

      at javax.servlet.http.HttpServlet.service(HttpServlet.java:742) [servlet-api.jar:?]

      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:231) [catalina.jar:8.5.35]

      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) [catalina.jar:8.5.35]

      at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52) [tomcat-websocket.jar:8.5.35]

      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) [catalina.jar:8.5.35]

      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) [catalina.jar:8.5.35]

      at ... our app jar's stacktrace

      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) [catalina.jar:8.5.35]

      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) [catalina.jar:8.5.35]

      at ... our app jar's stacktrace

      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) [catalina.jar:8.5.35]

      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) [catalina.jar:8.5.35]

      at org.apache.logging.log4j.web.Log4jServletFilter.doFilter(Log4jServletFilter.java:71) [log4j-web-2.10.0.jar:2.10.0]

      at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:193) [catalina.jar:8.5.35]

      at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:166) [catalina.jar:8.5.35]

      at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:198) [catalina.jar:8.5.35]

      at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:96) [catalina.jar:8.5.35]

      at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:493) [catalina.jar:8.5.35]

      at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:140) [catalina.jar:8.5.35]

      at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:81) [catalina.jar:8.5.35]

      at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:87) [catalina.jar:8.5.35]

      at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:342) [catalina.jar:8.5.35]

      at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:800) [tomcat-coyote.jar:8.5.35]

      at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:66) [tomcat-coyote.jar:8.5.35]

      at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:806) [tomcat-coyote.jar:8.5.35]

      at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1498) [tomcat-coyote.jar:8.5.35]

      at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) [tomcat-coyote.jar:8.5.35]

      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]

      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]

      at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) [tomcat-util.jar:8.5.35]

      at java.lang.Thread.run(Unknown Source) [?:?]

      Caused by: java.net.SocketTimeoutException: GetOperation{OUR_CACHE, key=[B0x033E22742D663039..[37], flags=0} timed out after 60000 ms

      at org.infinispan.client.hotrod.impl.operations.HotRodOperation.run(HotRodOperation.java:172) ~[infinispan-client-hotrod-9.4.0.Final.jar:9.4.0.Final]

      at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-common-4.1.28.Final.jar:4.1.28.Final]

      at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127) ~[netty-common-4.1.28.Final.jar:4.1.28.Final]

      at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[netty-common-4.1.28.Final.jar:4.1.28.Final]

      at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404) ~[netty-common-4.1.28.Final.jar:4.1.28.Final]

      at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:322) ~[netty-transport-native-epoll-4.1.28.Final-linux-x86_64.jar:4.1.28.Final]

      at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) ~[netty-common-4.1.28.Final.jar:4.1.28.Final]

      at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]

      at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]

      ... 1 more

       

      Was hoping someone could please advise on what is the root cause, and the potential fix.

      Thanks,

      _Prateek

        • 1. Re: Intermittent SocketTimeoutException during GetOperation
          galder.zamarreno

          You should check what's going on in the server. Is the server somehow hanging? E.g. deadlock or full GC. Thread dumps and GC logs in the server might give you some clues.

          • 2. Re: Intermittent SocketTimeoutException during GetOperation
            prateekk.in

            The rest api call on the server which triggers this GetOperation, does seem to be hanging as a consequence of the hanging GetOperation (and since the rest api doesn't return the UI hangs as well).

             

            I'm not sure if the server itself is hanging at that point, but issue is seen intermittently at only about 30 minutes apart - and repeating the operation from the UI right after noticing the stall, succeeds right away.

             

            I will try to look into generating more logs like above, but this is a docker container (which has jdk and tomcat copied into at build time).

            • 3. Re: Intermittent SocketTimeoutException during GetOperation
              prateekk.in

              I tried taking thread logs by/when catching TransportException and tools suggest there's no deadlock or issue that I can see.

               

              I also tried with clientIntelligence(ClientIntelligence.BASIC).tcpNoDelay(true).tcpKeepAlive(true).connectionPool().timeBetweenEvictionRuns(120000).minEvictableIdleTime(1800000 / 3).testWhileIdle(true).testOnReturn(true).testOnBorrow(true) and external-host as my docker host in the <hotrod-connector><topology-state-transfer> - issue still persists. I'm not sure if this is any config issue or something deeper at the transport layer in docker networking environment, that is resulting in TransportException (Infinispan JavaDoc All 9.1.7.Final API)  after some minutes (even though it initially works fine).

               

              I'm also wondering if replacing hotrod with connectors using memcached/REST might help.

              • 4. Re: Intermittent SocketTimeoutException during GetOperation
                galder.zamarreno

                Apart from checking that GC is not holding up the server and checking that there are no threads waiting for something for a long time, did you check the server logs? Did you enable TRACE logging? You should be able to grep by server side threads and see what they're doing, maybe you can try to correlate client side requests with server side responses... IOW, you need to figure out why the server is not replying in time

                 

                Replacing Hot Rod for other connector would a blind way to address this, and it has its penalties since the routing via Memcached/REST is not as intelligent.

                • 5. Re: Intermittent SocketTimeoutException during GetOperation
                  prateekk.in

                  I did get the GC logs and ran it through some online tools (seemed to show no major issues and no full GC), and our server logs were at the ALL level. The TransportException is a RuntimeException being raised within the hotrod connector, but even from the infinispan-client-hotrod-9.4.0.Final-sources.jar I couldn't find what was causing the issue itself.

                   

                  I am seeing that by switching to REST connector, it works fine - no sockettimeouts there. Could this indicate that hotrod is doing some rebalancing or maintaining some "persistent" connection to the infinispan server, and this fails after some time in a Docker Swarm overlay network? We aren't using infinispan as a cluster or data grid, but mostly as a distributed cache to store unstructured data which we don't want to keep in a database. So there's just one instance and that too is a docker service (Docker Hub ) without replicas.

                  • 6. Re: Intermittent SocketTimeoutException during GetOperation
                    galder.zamarreno

                    The clues here would be what happens in the server. You should also inspect thread dumps...etc. Don't have experience with Docker Swarm so no idea