14 Replies Latest reply on Sep 6, 2013 5:23 PM by mackerman

Hotrod resource leak - too many open files

mackerman Aug 12, 2013 5:38 PM

We are running into a problem with a custom cache loader where we have a resource leak that looks like is due to Hotrod. We are running Infinispan 5.3.0.Final on CentOS 6.3. After running for a few days we have ended up with a few thousand of open files (around 2222 at the time this log was snaphot). Attached is a log of all file/socket open and closes (obtained via http://file-leak-detector.kohsuke.org/). Basically we see a few thousand of these traces:

Opened socket channel by thread:Timer-0 on Sat Aug 10 00:08:25 UTC 2013

at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:87)

at sun.nio.ch.SelectorProviderImpl.openSocketChannel(SelectorProviderImpl.java:37)

at java.nio.channels.SocketChannel.open(SocketChannel.java:105)

at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.<init>(TcpTransport.java:84)

at org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory.makeObject(TransportObjectFactory.java:57)

at org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory.makeObject(TransportObjectFactory.java:38)

at org.apache.commons.pool.impl.GenericKeyedObjectPool.addObject(GenericKeyedObjectPool.java:1729)

at org.apache.commons.pool.impl.GenericKeyedObjectPool.ensureMinIdle(GenericKeyedObjectPool.java:2095)

at org.apache.commons.pool.impl.GenericKeyedObjectPool.ensureMinIdle(GenericKeyedObjectPool.java:2060)

at org.apache.commons.pool.impl.GenericKeyedObjectPool.access$1600(GenericKeyedObjectPool.java:204)

at org.apache.commons.pool.impl.GenericKeyedObjectPool$Evictor.run(GenericKeyedObjectPool.java:2360)

at java.util.TimerThread.mainLoop(Timer.java:512)

at java.util.TimerThread.run(Timer.java:462)

without corresponding closes.

If you look at the log file you will see that there are other sockets being opened and closed, which I believe can be ignored. These are "Opened socket to null", which relate to database connections and rest messaging, and have mostly been closed by "Closed socket to cache.dev.norcal.pgilab.net".

At one point the server crashed due to a "too many open files" exception, which looks like it will eventually happen again given the current resource leak. our ulimit is 2^15.

Any insight would be appreciated.

thanks, Mitchell

10.81.224.163-open-files.log.zip 390.9 KB

1. Re: Hotrod resource leak - too many open files

mackerman Aug 30, 2013 10:44 AM (in response to mackerman)
After turning on trace level logging for specific infinispan classes I have more information. For a run of about a day I have 935 socket opens, and 105 socket closes:

i.e., 935 logs like this:

2013-08-14 13:03:24,757 [pool-3-thread-150] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.225.12,port=11222,localport=36368], serverAddress=cache.dev.norcal.pgilab.net/10.81.225.12:11222, id =829}

Opened socket channel by thread:pool-3-thread-70 on Wed Aug 14 13:03:24 UTC 2013
          at sun.nio.ch.SocketChannelImpl.<init>(Unknown Source)
          at sun.nio.ch.SelectorProviderImpl.openSocketChannel(Unknown Source)
          at java.nio.channels.SocketChannel.open(Unknown Source)
          at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.<init>(TcpTransport.java:84)

and only 105 like:

2013-08-14 13:36:17,928 [Timer-0] TRACE [TcpTransport.destroy() 293] - Successfully closed socket: Socket[unconnected]

Unless i'm reading this incorrectly, it looks like there is a socket leak.

I've attached additional log files.

Mitchell

10.81.225.19-open-files.log.zip 147.6 KB
Actions

2. Re: Hotrod resource leak - too many open files

mackerman Aug 28, 2013 7:11 PM (in response to mackerman)

Well, it's pretty clear from debugging the ConnectionPooling code (org.apache.commons.pool.impl.GenericKeyedObjectPool) that there is a resource leak, and its due to the Idle Eviction Thread evicting the only idle connection which then results in a new one being created. The thread runs every 2 mins, so we are consuming a file handle every 2 mins, without releasing it. It wouldn't be so bad if the eviction strategy destroyed the connection, but it doesn't. For whatever reason, the method resetEvictionObjectCursor(), called by evict(), ensures that the only idle connection will be evicted. Here is an example of creating new connections, you can see that it occurs every 2 mins, though not sure why it is occurring twice each time.

	Line 102110: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:47:45 UTC 2013
	Line 102124: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:47:45 UTC 2013
	Line 102338: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:49:45 UTC 2013
	Line 102352: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:49:45 UTC 2013
	Line 102566: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:51:45 UTC 2013
	Line 102580: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:51:45 UTC 2013
	Line 102794: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:53:45 UTC 2013
	Line 102808: Opened socket channel by thread:Timer-0 on Wed Aug 28 10:53:45 UTC 2013

2 different things i'm going to try are disabling the Idle Eviction Thread, and setting minIdle to 0.

I can't help but notice that there has been no comment on this issue - is noone really interested that Hotrod has a critical resource leak!!!

Mitchell

3. Re: Hotrod resource leak - too many open files

mlinhard Aug 29, 2013 3:30 AM (in response to mackerman)

Hello Mitchell,

sorry about the delayed response, I'm looking into this, trying to reproduce it for 5.3.0.Final

Michal.
Actions
4. Re: Hotrod resource leak - too many open files

mlinhard Aug 29, 2013 4:44 AM (in response to mlinhard)

Could you please describe your setup ?

You've written that you use 5.3.0.Final server + a custom cache loader. How is this all connected ?

App using InVM Infinispan + CacheLoader using remote hotrod client to access Infinispan server ?
App using hotrod client to connect to remote infinispan server using custom loader ?

Where in the schema does the leak happen ?
Actions
5. Re: Hotrod resource leak - too many open files

mlinhard Aug 29, 2013 5:59 AM (in response to mackerman)

OK, from what I can see from the trace logs, you have a custom Java SE tool using hot rod client (Infinispan 5.3.0.Final) to connect to a remote hot rod server and the leak happens inside of the JVM of that tool.
This way you have a complete controll over connection pooling properties of the RemoteCacheManager (RemoteCacheManager (Infinispan Distribution 5.3.0.Final API) Can you please share the ones you're using ?
Actions
6. Re: Hotrod resource leak - too many open files

mlinhard Aug 29, 2013 12:41 PM (in response to mlinhard)

This is indeed a problem with leaking sockets in Hot Rod client's connection pool: [#ISPN-3452] Hot Rod client socket leak - JBoss Issue Tracker
It happens when a hot rod server stops responding and subsequent pings from eviction thread in connection pool fail without cleanup of the socket.
In your tracelogs this is the case of the server cache.dev.norcal.pgilab.net/10.81.225.12
after cca this time:
2013-08-14 01:01:02,639 [Thread-3] TRACE [TcpTransportFactory.releaseTransport() 217] - Dropping connection as it is no longer valid: TcpTransport{socket=Socket[unconnected], serverAddress=cache.dev.norcal.pgilab.net/10.81.225.12:11222, id =4}
every 2 minutes a new unclosed socket (to 10.81.225.12) is created.

BTW your trace logs were very helpful.
Thanks for reporting!
Actions
7. Re: Re: Hotrod resource leak - too many open files

mackerman Aug 29, 2013 7:30 PM (in response to mlinhard)
Thanks for looking at the problem Michal.

I've been testing a number of different scenarios, one of which included the hotrod server terminating and not being available for a while. It looks like you've found the issue from the 8-14 log as described above. That scenario related to a custom cache loader, loading a remote hotrod server (which would then be replicated to the other hotrod server in the cluster). The server would become unavailable but the cache loader would continue to try to communicate with it.

I've also found another issue, where a hotrod server is available, but the eviction algorithm evicts a valid idle connection, does not destroy it, and then creates a new connection to replace it. This is described in my post yesterday, 8-28. In this instance, a different application was only attempting to get data from hotrod initially, and thereafter do nothing with hotrod, but the RemoteCacheManager remained connected. I don't have trace level logging for this scenario (though can generate some if needed), but the attached log shows the connections being opened every 2 mins, but never closed.

In both cases, the RemoteCacheManager is created by:

properties = new Properties();
// looks like the timeout is being ignored btw
properties.put("infinispan.client.hotrod.socket_timeout", BobConfiguration.getConfiguration().getInt("infinispan.client.hotrod.socket_timeout", 2500));
properties.put("infinispan.client.hotrod.transport_factory", RetryTcpTransportFactory.class.getName());

cacheManager = new RemoteCacheManager(properties);
cacheManager.start();

/**
* Purpose of overriding the transport factory is to limit the number of retries when connecting
* to hotrod. As of Infinispan 5.3 the only purpose of the transport count was to determine the
* number of retries, which by default is 10x the number of hotrod cluster members, so, for us
* where we have 2 hotrods (per region) that would come out to 20 retries, which is too high.
* Setting the transport count to 2 makes the retry count 2.
*/
public static class RetryTcpTransportFactory extends TcpTransportFactory {
    @Override
    public int getTransportCount() {
          return 2;
    }
}

Let me know if you need any further information.

Mitchell

10.81.228.155-open-files.log.zip 84.2 KB
Actions
8. Re: Re: Hotrod resource leak - too many open files

mlinhard Aug 30, 2013 6:19 AM (in response to mackerman)

Hi Mitchell,

I can't reproduce the scenario where Evictor thread evicts a valid connection. I created a similar scenario as in ISPN-3452 that lets evictor thread run on idle connection pool, where the server responds normally (thus all idle connections should be valid).
With Infinispan 5.3.0.Final and current master (6.0.0-SNAPSHOT) I'm always seeing only one same socket staying in the pool.

m.
Actions
9. Re: Re: Re: Hotrod resource leak - too many open files

mackerman Aug 30, 2013 12:26 PM (in response to mlinhard)
Michal, here are the logs from the scenario where there is an idle Hotrod connection, with an eviction thread running. You can see that there have been 76 sockets created:

[Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=48987], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =5}

Opened socket channel by thread:Timer-0 on Fri Aug 30 14:58:48 UTC 2013
at sun.nio.ch.SocketChannelImpl.<init>(Unknown Source)
at sun.nio.ch.SelectorProviderImpl.openSocketChannel(Unknown Source)
at java.nio.channels.SocketChannel.open(Unknown Source)
at org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport.<init>(TcpTransport.java:84)
at org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory.makeObject(TransportObjectFactory.java:57)
at org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory.makeObject(TransportObjectFactory.java:38)
at org.apache.commons.pool.impl.GenericKeyedObjectPool.addObject(GenericKeyedObjectPool.java:1729)
at org.apache.commons.pool.impl.GenericKeyedObjectPool.ensureMinIdle(GenericKeyedObjectPool.java:2095)
at org.apache.commons.pool.impl.GenericKeyedObjectPool.ensureMinIdle(GenericKeyedObjectPool.java:2060)
at org.apache.commons.pool.impl.GenericKeyedObjectPool.access$1600(GenericKeyedObjectPool.java:204)
at org.apache.commons.pool.impl.GenericKeyedObjectPool$Evictor.run(GenericKeyedObjectPool.java:2360)
at java.util.TimerThread.mainLoop(Unknown Source)
at java.util.TimerThread.run(Unknown Source)

but do destroy transport. You can also see that this is occurring twice every 2 mins:

Line 198: 2013-08-30 14:58:48,486 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=48987], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =5}
Line 202: 2013-08-30 14:58:49,008 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=48988], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =6}
Line 238: 2013-08-30 15:00:48,467 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=48996], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =7}
Line 242: 2013-08-30 15:00:49,008 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=48997], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =8}
Line 279: 2013-08-30 15:02:48,474 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=49005], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =9}
Line 283: 2013-08-30 15:02:49,008 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.224.166,port=11222,localport=49006], serverAddress=cache.dev.norcal.pgilab.net/10.81.224.166:11222, id =10}

The application is creating a hotrod connection as shown in an earlier post. We actually create connections to multiple remote caches:

2013-08-30 14:56:49,087 [main] INFO [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache PropertiesCache
2013-08-30 14:56:49,090 [main] INFO [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache ReservationCache
2013-08-30 14:56:49,093 [main] INFO [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache PhoneGroupProductCache
2013-08-30 14:56:49,095 [main] INFO [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache PasscodeCache
2013-08-30 14:56:49,098 [main] INFO [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache DeviceCache
2013-08-30 14:56:49,100 [main] INFO [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache CallBrandingCache
2013-08-30 14:56:49,126 [main] INFO [DatabaseCacheManager$RemoteCacheContainer.getRemoteCache() 87] - Connected to RemoteCache CallBrandingLookupCache

All this application is doing is getting properties from the PropertiesCache, and then there is no further application communication with hotrod, though the connection remains active.

Let me know if you need any further information.

Mitchell

10.81.228.155-blobapp.log.zip 18.3 KB

10.81.228.155-open-files-2.log.zip 12.7 KB
Actions
10. Re: Hotrod resource leak - too many open files

mlinhard Aug 30, 2013 1:07 PM (in response to mackerman)

In your trace logs I can't see whether the validation of connection was successful or not, previously your logs looked like (with trace enabled on TcpTransport)

2013-08-13 15:30:25,517 [Timer-0] TRACE [TransportObjectFactory.validateObject() 83] - About to validate(ping) connection to server cache.dev.norcal.pgilab.net/10.81.225.12:11222. TcpTransport is TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net..

2013-08-13 15:30:25,517 [Timer-0] TRACE [TcpTransport.writeByte() 159] - Wrote byte 160

2013-08-13 15:30:25,522 [Timer-0] TRACE [TcpTransport.writeByte() 159] - Wrote byte 12

2013-08-13 15:30:25,523 [Timer-0] TRACE [TcpTransport.writeByte() 159] - Wrote byte 23

2013-08-13 15:30:25,523 [Timer-0] TRACE [TcpTransport.writeBytes() 145] - Wrote 0 bytes

2013-08-13 15:30:25,527 [Timer-0] TRACE [TcpTransport.writeByte() 159] - Wrote byte 3

2013-08-13 15:30:25,527 [Timer-0] TRACE [TcpTransport.writeByte() 159] - Wrote byte 0

2013-08-13 15:30:25,528 [Timer-0] TRACE [TcpTransport.flush() 174] - Flushed socket: Socket[addr=cache.dev.norcal.pgilab.net/10.81.225.12,port=11222,localport=53852]

2013-08-13 15:30:25,528 [Timer-0] TRACE [TcpTransport.readByte() 189] - Read byte -1 from socket input in Socket[addr=cache.dev.norcal.pgilab.net/10.81.225.12,port=11222,localport=53852]

2013-08-13 15:30:25,530 [Timer-0] TRACE [TransportObjectFactory.makeObject() 59] - Created tcp transport: TcpTransport{socket=Socket[addr=cache.dev.norcal.pgilab.net/10.81.225.12,port=11222,localport=53869], serverAddress=cache.dev.norcal.pgilab.net/10.81..

and I knew that ping fails and ISPN-3452 happens.
Could you produce trace logs with trace on TcpTransport enabled ? (or other category that would enable us to see the result of Ping operation) if it turns out that it fails, then it's ISPN-3452
Actions
11. Re: Re: Re: Hotrod resource leak - too many open files

mackerman Aug 30, 2013 1:27 PM (in response to mlinhard)

Michal, TRACE level has been set on that class:

2013-08-30 14:56:49,464 [main] INFO [LogUtil.assignLogLevel() 55] - TRACE for 'org.infinispan.client.hotrod.impl.transport'
2013-08-30 14:56:49,464 [main] INFO [LogUtil.assignLogLevel() 55] - TRACE for 'org.infinispan.client.hotrod.impl.transport.Transport'
2013-08-30 14:56:49,465 [main] INFO [LogUtil.assignLogLevel() 55] - initial to TRACE for 'org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport'
2013-08-30 14:56:49,465 [main] INFO [LogUtil.assignLogLevel() 55] - initial to TRACE for 'org.infinispan.client.hotrod.impl.transport.tcp.TransportObjectFactory'
2013-08-30 14:56:49,465 [main] INFO [LogUtil.assignLogLevel() 55] - initial to TRACE for 'org.infinispan.client.hotrod.impl.transport.tcp.TcpTransportFactory'
2013-08-30 14:56:49,466 [main] INFO [LogUtil.assignLogLevel() 55] - TRACE for 'org.infinispan.client.hotrod.impl.transport.TransportFactory'
Actions
12. Re: Hotrod resource leak - too many open files

mlinhard Sep 1, 2013 12:06 PM (in response to mackerman)

Well I'm not sure how the logging in your system works, but there is no output from TcpTransport class in the file 10.81.228.155-blobapp.log
I've noticed that there is a difference in the output in the trace logs:

2013-08-13 15:12:24,921 [main] INFO [LogUtil.assignLogLevel() 55] - TRACE for 'org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport'

2013-08-30 14:56:49,465 [main] INFO [LogUtil.assignLogLevel() 55] - initial to TRACE for 'org.infinispan.client.hotrod.impl.transport.tcp.TcpTransport'

I've no idea what that means but that might be the config difference that disables the logging in the second file.
Actions
13. Re: Hotrod resource leak - too many open files

mackerman Sep 4, 2013 11:42 AM (in response to mlinhard)

Michal, i've fixed the logging in the app - the problem was that the trace flag was being set during class loading as a static variable before the logging level was being set by the application. I've changed it to set it in log4j.xml, so it's now being setup correctly.

After running yesterday for a while, I did not observe the large file handle leak that I did before. I'm going to watch it today and will report back.

Mitchell
Actions
14. Re: Hotrod resource leak - too many open files

mackerman Sep 6, 2013 5:23 PM (in response to mackerman)

Michal, i've been unable to reproduce the problem with file handle leaks when the cache is not being used. I'll keep an eye on it and let you know should it re-occur. btw, i'll be out of the office for a while and won't be updating this thread.

Mitchell
Actions

Go to original post