Statement of problem:
The retry logic in SocketClientInvoker.transport attempts to ensure that the invoker does not get a stale socket on the last attempt by flushing the pool. however, this only discards any currently unused scokets. a second invoker could return a stale socket after the flush, causing the first invoker to fail again on the last retry (with an EOFException). while this seems like an edge case, we can hit it reliably under load with many connections to the same destination. i've patched this class so that the last call to getConnection will never return a pooled connection. we never got the EOFException after applying this fix.
My first comment:
I'm not sure I completely understand this issue.
I suspect that the reason EOFExceptions are appearing is related to some odd behavior in ObjectInputStream. It seems that once an ObjectInputStream times out, all subsequent readXXX() calls will throw an EOFException, even if the other side is alive and well.
It's certainly possible for one thread to return a socket to the pool while another socket is trying to extract one. My question is how timed out sockets are going back into the pool. A timeout in versionedRead() should result in a SocketTimeoutException, which would be caught by the catch (Exception ex) clause in transport(), which would throw the socket away.
Also, the MicroSocketClientInvoker.flushConnectionPool() method first appears in Remoting 2.0.0, but the 1.4.6.GA is given as the "affects version".
James, if you're listening, could you supply more information?
ah, my bad. i forgot that our library already includes patches backported from the 2.0.0 release, one of which is the flushConnectionPool() addition to the SocketClientInvoker (ported from the MicroSocketClientInvoker).
anyway, a little more background on our use case. we are running a service which is getting hit by multiple threads on many different boxes, easily maxing out the connection pool under peak load. this reveals a number of problems in the socket pooling algorithm, which led us to backport code which fixed some of the issues. this patch applies to that backported code, so probably really applies to the 2.0.0 release series. what i think is happening under load, however, is that a socket which is near the end of its useful life is added to the pool after the flush has been called (there's heavy turnover in the pool because of the high load). the socket did not fail for the previous caller, so is not closed, but fails when retrieved from the pool post-flush. this causes the client to run out of retries and fail the call with an EOFException. the changes i made force the client to ignore any pooled connections on the final retry, which solved the problem in our application.
James, thanks for responding. I'm not sure I understand what you mean by "a socket which is near the end of its useful life". Here's a conjecture. In release 1.4.5 we fixed a problem that allowed ServerThread's to leak, in the sense that they could end up in limbo in a non-running state. But on the client side the connection would look usable, so it would get picked for an invocation which would eventually time out. It sounds like you have a mixture of old and new code, so I'm wondering if you have an older version of ServerThread.run().