14 Replies Latest reply on Sep 6, 2013 5:23 PM by mackerman

    Hotrod resource leak - too many open files

    mackerman

      We are running into a problem with a custom cache loader where we have a resource leak that looks like is due to Hotrod.  We are running Infinispan 5.3.0.Final on CentOS 6.3.  After running for a few days we have ended up with a few thousand of open files (around 2222 at the time this log was snaphot).  Attached is a log of all file/socket open and closes (obtained via http://file-leak-detector.kohsuke.org/).  Basically we see a few thousand of these traces:

       

      Opened socket channel by thread:Timer-0 on Sat Aug 10 00:08:25 UTC 2013

                at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:87)

                at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:37)

                at java.nio.channels.SocketChannel.open(SocketChannel.java:105)

                at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.<init>(TcpTransport.java:84)

                at org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory.makeObject(TransportObjectFactory.java:57)

                at org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory.makeObject(TransportObjectFactory.java:38)

                at org.apache.commons.pool.impl.GenericKeyedObjectPool.addObject(GenericKeyedObjectPool.java:1729)

                at org.apache.commons.pool.impl.GenericKeyedObjectPool.ensureMinIdle(GenericKeyedObjectPool.java:2095)

                at org.apache.commons.pool.impl.GenericKeyedObjectPool.ensureMinIdle(GenericKeyedObjectPool.java:2060)

                at org.apache.commons.pool.impl.GenericKeyedObjectPool.access$1600(GenericKeyedObjectPool.java:204)

                at org.apache.commons.pool.impl.GenericKeyedObjectPool$Evictor.run(GenericKeyedObjectPool.java:2360)

                at java.util.TimerThread.mainLoop(Timer.java:512)

                at java.util.TimerThread.run(Timer.java:462)

       

      without corresponding closes.

       

      If you look at the log file you will see that there are other sockets being opened and closed, which I believe can be ignored.  These are "Opened socket to null", which relate to database connections and rest messaging, and have mostly been closed by "Closed socket to cache.dev.norcal.pgilab.net".

       

      At one point the server crashed due to a "too many open files" exception, which looks like it will eventually happen again given the current resource leak.  our ulimit is 2^15.

       

      Any insight would be appreciated.

       

      thanks, Mitchell

        • 1. Re: Hotrod resource leak - too many open files
          mackerman

          After turning on trace level logging for specific infinispan classes I have more information.  For a run of about a day I have 935 socket opens, and 105 socket closes:

           

          i.e., 935 logs like this:

           

          2013-08-14 13:03:24,757 [pool-3-thread-150] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.225.12,port=11222,localport=36368], serverAddress=cache.dev.norcal.pgilab.net/10.81.225.12:11222, id =829}

           

          Opened socket channel by thread:pool-3-thread-70 on Wed Aug 14 13:03:24 UTC 2013

                    at sun.nio.ch.SocketChannelImpl.<init>(Unknown Source)

                    at sun.nio.ch.SelectorProviderImpl.openSocketChannel(Unknown Source)

                    at java.nio.channels.SocketChannel.open(Unknown Source)

                    at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.<init>(TcpTransport.java:84)

           

          and only 105 like:

           

          2013-08-14 13:36:17,928 [Timer-0] TRACE [TcpTransport.destroy() 293] - Successfully closed socket: Socket[unconnected]

           

          Unless i'm reading this incorrectly, it looks like there is a socket leak.

           

          I've attached additional log files.

           

          Mitchell

          • 2. Re: Hotrod resource leak - too many open files
            mackerman

            Well, it's pretty clear from debugging the ConnectionPooling code (org.apache.commons.pool.impl.GenericKeyedObjectPool) that there is a resource leak, and its due to the Idle Eviction Thread evicting the only idle connection which then results in a new one being created.  The thread runs every 2 mins, so we are consuming a file handle every 2 mins, without releasing it. It wouldn't be so bad if the eviction strategy destroyed the connection, but it doesn't.  For whatever reason, the method resetEvictionObjectCursor(), called by evict(), ensures that the only idle connection will be evicted.  Here is an example of creating new connections, you can see that it occurs every 2 mins, though not sure why it is occurring twice each time.

             

            Line 102110: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:47:45 UTC 2013
            Line 102124: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:47:45 UTC 2013
            Line 102338: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:49:45 UTC 2013
            Line 102352: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:49:45 UTC 2013
            Line 102566: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:51:45 UTC 2013
            Line 102580: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:51:45 UTC 2013
            Line 102794: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:53:45 UTC 2013
            Line 102808: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:53:45 UTC 2013

             

            2 different things i'm going to try are disabling the Idle Eviction Thread, and setting minIdle to 0.

             

            I can't help but notice that there has been no comment on this issue - is noone really interested that Hotrod has a critical resource leak!!!

             

            Mitchell

            • 3. Re: Hotrod resource leak - too many open files
              mlinhard

              Hello Mitchell,

               

              sorry about the delayed response, I'm looking into this, trying to reproduce it for 5.3.0.Final

               

              Michal.

              • 4. Re: Hotrod resource leak - too many open files
                mlinhard

                Could you please describe your setup ?

                 

                You've written that you use 5.3.0.Final server + a custom cache loader. How is this all connected ?

                 

                App using InVM Infinispan + CacheLoader using remote hotrod client to access Infinispan server ?

                App using hotrod client to connect to remote infinispan server using custom loader ?

                 

                Where in the schema does the leak happen ?

                • 5. Re: Hotrod resource leak - too many open files
                  mlinhard

                  OK, from what I can see from the trace logs, you have a custom Java SE tool using hot rod client (Infinispan 5.3.0.Final) to connect to a remote hot rod server and the leak happens inside of the JVM of that tool.

                  This way you have a complete controll over connection pooling properties of the RemoteCacheManager (RemoteCacheManager (Infinispan Distribution 5.3.0.Final API) Can you please share the ones you're using ?

                  • 6. Re: Hotrod resource leak - too many open files
                    mlinhard

                    This is indeed a problem with leaking sockets in Hot Rod client's connection pool: [#ISPN-3452] Hot Rod client socket leak - JBoss Issue Tracker

                    It happens when a hot rod server stops responding and subsequent pings from eviction thread in connection pool fail without cleanup of the socket.

                    In your tracelogs this is the case of the server cache.dev.norcal.pgilab.net/10.81.225.12

                    after cca this time:

                    2013-08-14 01:01:02,639 [Thread-3] TRACE [TcpTransportFactory.releaseTransport() 217] - Dropping connection as it is no longer valid: TcpTransport{socket=Socket[unconnected], serverAddress=cache.dev.norcal.pgilab.net/10.81.225.12:11222, id =4}

                    every 2 minutes a new unclosed socket (to 10.81.225.12) is created.

                     

                    BTW your trace logs were very helpful.

                    Thanks for reporting!

                    • 7. Re: Re: Hotrod resource leak - too many open files
                      mackerman

                      Thanks for looking at the problem Michal.

                       

                      I've been testing a number of different scenarios, one of which included the hotrod server terminating and not being available for a while.  It looks like you've found the issue from the 8-14 log as described above.  That scenario related to a custom cache loader, loading a remote hotrod server (which would then be replicated to the other hotrod server in the cluster).  The server would become unavailable but the cache loader would continue to try to communicate with it.

                       

                      I've also found another issue, where a hotrod server is available, but the eviction algorithm evicts a valid idle connection, does not destroy it, and then creates a new connection to replace it.  This is described in my post yesterday, 8-28.  In this instance, a different application was only attempting to get data from hotrod initially, and thereafter do nothing with hotrod, but the RemoteCacheManager remained connected.  I don't have trace level logging for this scenario (though can generate some if needed), but the attached log shows the connections being opened every 2 mins, but never closed.

                       

                      In both cases, the RemoteCacheManager is created by:

                       

                      properties = new Properties();

                      // looks like the timeout is being ignored btw

                      properties.put("infinispan.client.hotrod.socket_timeout", BobConfiguration.getConfiguration().getInt("infinispan.client.hotrod.socket_timeout", 2500));

                      properties.put("infinispan.client.hotrod.transport_factory", RetryTcpTransportFactory.class.getName()); 

                       

                      cacheManager = new RemoteCacheManager(properties);

                      cacheManager.start();

                       

                      /**

                      * Purpose of overriding the transport factory is to limit the number of retries when connecting

                      * to hotrod.  As of Infinispan 5.3 the only purpose of the transport count was to determine the

                      * number of retries, which by default is 10x the number of hotrod cluster members, so, for us

                      * where we have 2 hotrods (per region) that would come out to 20 retries, which is too high.

                      * Setting the transport count to 2 makes the retry count 2.

                      */

                      public static class RetryTcpTransportFactory extends TcpTransportFactory {

                          @Override

                          public int getTransportCount() {

                                return 2;

                          }

                      }


                      Let me know if you need any further information.

                       

                      Mitchell

                      • 8. Re: Re: Hotrod resource leak - too many open files
                        mlinhard

                        Hi Mitchell,

                         

                        I can't reproduce the scenario where Evictor thread evicts a valid connection. I created a similar scenario as in ISPN-3452 that lets evictor thread run on idle connection pool, where the server responds normally (thus all idle connections should be valid).

                        With Infinispan 5.3.0.Final and current master (6.0.0-SNAPSHOT) I'm always seeing only one same socket staying in the pool.

                         

                        m.

                        • 9. Re: Re: Re: Hotrod resource leak - too many open files
                          mackerman

                          Michal, here are the logs from the scenario where there is an idle Hotrod connection, with an eviction thread running.  You can see that there have been 76 sockets created:

                           

                          [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=48987], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =5}

                           

                          Opened socket channel by thread:Timer-0 on Fri Aug 30 14:58:48 UTC 2013

                            at sun.nio.ch.SocketChannelImpl.<init>(Unknown Source)

                            at sun.nio.ch.SelectorProviderImpl.openSocketChannel(Unknown Source)

                            at java.nio.channels.SocketChannel.open(Unknown Source)

                            at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.<init>(TcpTransport.java:84)

                            at org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory.makeObject(TransportObjectFactory.java:57)

                            at org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory.makeObject(TransportObjectFactory.java:38)

                            at org.apache.commons.pool.impl.GenericKeyedObjectPool.addObject(GenericKeyedObjectPool.java:1729)

                            at org.apache.commons.pool.impl.GenericKeyedObjectPool.ensureMinIdle(GenericKeyedObjectPool.java:2095)

                            at org.apache.commons.pool.impl.GenericKeyedObjectPool.ensureMinIdle(GenericKeyedObjectPool.java:2060)

                            at org.apache.commons.pool.impl.GenericKeyedObjectPool.access$1600(GenericKeyedObjectPool.java:204)

                            at org.apache.commons.pool.impl.GenericKeyedObjectPool$Evictor.run(GenericKeyedObjectPool.java:2360)

                            at java.util.TimerThread.mainLoop(Unknown Source)

                            at java.util.TimerThread.run(Unknown Source)

                           

                          but do destroy transport.  You can also see that this is occurring twice every 2 mins:

                           

                          Line 198: 2013-08-30 14:58:48,486 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=48987], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =5}

                          Line 202: 2013-08-30 14:58:49,008 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=48988], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =6}

                          Line 238: 2013-08-30 15:00:48,467 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=48996], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =7}

                          Line 242: 2013-08-30 15:00:49,008 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=48997], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =8}

                          Line 279: 2013-08-30 15:02:48,474 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=49005], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =9}

                          Line 283: 2013-08-30 15:02:49,008 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=49006], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =10}

                           

                          The application is creating a hotrod connection as shown in an earlier post.  We actually create connections to multiple remote caches:

                           

                          2013-08-30 14:56:49,087 [main] INFO  [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache PropertiesCache

                          2013-08-30 14:56:49,090 [main] INFO  [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache ReservationCache

                          2013-08-30 14:56:49,093 [main] INFO  [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache PhoneGroupProductCache

                          2013-08-30 14:56:49,095 [main] INFO  [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache PasscodeCache

                          2013-08-30 14:56:49,098 [main] INFO  [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache DeviceCache

                          2013-08-30 14:56:49,100 [main] INFO  [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache CallBrandingCache

                          2013-08-30 14:56:49,126 [main] INFO  [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache CallBrandingLookupCache

                           

                          All this application is doing is getting properties from the PropertiesCache, and then there is no further application communication with hotrod, though the connection remains active.

                           

                          Let me know if you need any further information.

                           

                          Mitchell

                          • 10. Re: Hotrod resource leak - too many open files
                            mlinhard

                            In your trace logs I can't see whether the validation of connection was successful or not, previously your logs looked like (with trace enabled on TcpTransport)

                            2013-08-13 15:30:25,517 [Timer-0] TRACE [TransportObjectFactory.validateObject() 83] - About to validate(ping) connection to server cache.dev.norcal.pgilab.net/10.81.225.12:11222. TcpTransport is TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net..

                            2013-08-13 15:30:25,517 [Timer-0] TRACE [TcpTransport.writeByte() 159] - Wrote byte 160

                            2013-08-13 15:30:25,522 [Timer-0] TRACE [TcpTransport.writeByte() 159] - Wrote byte 12

                            2013-08-13 15:30:25,523 [Timer-0] TRACE [TcpTransport.writeByte() 159] - Wrote byte 23

                            2013-08-13 15:30:25,523 [Timer-0] TRACE [TcpTransport.writeBytes() 145] - Wrote 0 bytes

                            2013-08-13 15:30:25,527 [Timer-0] TRACE [TcpTransport.writeByte() 159] - Wrote byte 3

                            2013-08-13 15:30:25,527 [Timer-0] TRACE [TcpTransport.writeByte() 159] - Wrote byte 0

                            2013-08-13 15:30:25,528 [Timer-0] TRACE [TcpTransport.flush() 174] - Flushed socket: Socket[addr=cache.dev.norcal.pgilab.net/10.81.225.12,port=11222,localport=53852]

                            2013-08-13 15:30:25,528 [Timer-0] TRACE [TcpTransport.readByte() 189] - Read byte -1 from socket input in Socket[addr=cache.dev.norcal.pgilab.net/10.81.225.12,port=11222,localport=53852]

                            2013-08-13 15:30:25,530 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.225.12,port=11222,localport=53869], serverAddress=cache.dev.norcal.pgilab.net/10.81..

                            and I knew that ping fails and ISPN-3452 happens.

                            Could you produce trace logs with trace on TcpTransport enabled ? (or other category that would enable us to see the result of Ping operation) if it turns out that it fails, then it's ISPN-3452

                            • 11. Re: Re: Re: Hotrod resource leak - too many open files
                              mackerman

                              Michal, TRACE level has been set on that class:

                               

                              2013-08-30 14:56:49,464 [main] INFO  [LogUtil.assignLogLevel() 55] - TRACE for 'org.infinispan.client.hotrod.impl.transport'

                              2013-08-30 14:56:49,464 [main] INFO  [LogUtil.assignLogLevel() 55] - TRACE for 'org.infinispan.client.hotrod.impl.transport.Transport'

                              2013-08-30 14:56:49,465 [main] INFO  [LogUtil.assignLogLevel() 55] - initial to TRACE for 'org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport'

                              2013-08-30 14:56:49,465 [main] INFO  [LogUtil.assignLogLevel() 55] - initial to TRACE for 'org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory'

                              2013-08-30 14:56:49,465 [main] INFO  [LogUtil.assignLogLevel() 55] - initial to TRACE for 'org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory'

                              2013-08-30 14:56:49,466 [main] INFO  [LogUtil.assignLogLevel() 55] - TRACE for 'org.infinispan.client.hotrod.impl.transport.TransportFactory'

                              • 12. Re: Hotrod resource leak - too many open files
                                mlinhard

                                Well I'm not sure how the logging in your system works, but there is no output from TcpTransport class in the file 10.81.228.155-blobapp.log

                                I've noticed that there is a difference in the output in the trace logs:

                                2013-08-13 15:12:24,921 [main] INFO  [LogUtil.assignLogLevel() 55] - TRACE for 'org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport'

                                2013-08-30 14:56:49,465 [main] INFO  [LogUtil.assignLogLevel() 55] - initial to TRACE for 'org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport'

                                 

                                I've no idea what that means but that might be the config difference that disables the logging in the second file.

                                • 13. Re: Hotrod resource leak - too many open files
                                  mackerman

                                  Michal, i've fixed the logging in the app - the problem was that the trace flag was being set during class loading as a static variable before the logging level was being set by the application.  I've changed it to set it in log4j.xml, so it's now being setup correctly.

                                   

                                  After running yesterday for a while, I did not observe the large file handle leak that I did before.  I'm going to watch it today and will report back.

                                   

                                  Mitchell

                                  • 14. Re: Hotrod resource leak - too many open files
                                    mackerman

                                    Michal, i've been unable to reproduce the problem with file handle leaks when the cache is not being used.  I'll keep an eye on it and let you know should it re-occur.  btw, i'll be out of the office for a while and won't be updating this thread.


                                    Mitchell