2 Replies Latest reply on Dec 20, 2007 1:42 AM by ron_sigal

    JBREM-786:

    ron_sigal

      Statement of problem:


      The retry logic in SocketClientInvoker.transport attempts to ensure that the invoker does not get a stale socket on the last attempt by flushing the pool. however, this only discards any currently unused scokets. a second invoker could return a stale socket after the flush, causing the first invoker to fail again on the last retry (with an EOFException). while this seems like an edge case, we can hit it reliably under load with many connections to the same destination. i've patched this class so that the last call to getConnection will never return a pooled connection. we never got the EOFException after applying this fix.


      My first comment:


      I'm not sure I completely understand this issue.

      I suspect that the reason EOFExceptions are appearing is related to some odd behavior in ObjectInputStream. It seems that once an ObjectInputStream times out, all subsequent readXXX() calls will throw an EOFException, even if the other side is alive and well.

      It's certainly possible for one thread to return a socket to the pool while another socket is trying to extract one. My question is how timed out sockets are going back into the pool. A timeout in versionedRead() should result in a SocketTimeoutException, which would be caught by the catch (Exception ex) clause in transport(), which would throw the socket away.

      Also, the MicroSocketClientInvoker.flushConnectionPool() method first appears in Remoting 2.0.0, but the 1.4.6.GA is given as the "affects version".

      James, if you're listening, could you supply more information?


      Jame's reply


      ah, my bad. i forgot that our library already includes patches backported from the 2.0.0 release, one of which is the flushConnectionPool() addition to the SocketClientInvoker (ported from the MicroSocketClientInvoker).

      anyway, a little more background on our use case. we are running a service which is getting hit by multiple threads on many different boxes, easily maxing out the connection pool under peak load. this reveals a number of problems in the socket pooling algorithm, which led us to backport code which fixed some of the issues. this patch applies to that backported code, so probably really applies to the 2.0.0 release series. what i think is happening under load, however, is that a socket which is near the end of its useful life is added to the pool after the flush has been called (there's heavy turnover in the pool because of the high load). the socket did not fail for the previous caller, so is not closed, but fails when retrieved from the pool post-flush. this causes the client to run out of retries and fail the call with an EOFException. the changes i made force the client to ignore any pooled connections on the final retry, which solved the problem in our application.


      James, thanks for responding. I'm not sure I understand what you mean by "a socket which is near the end of its useful life". Here's a conjecture. In release 1.4.5 we fixed a problem that allowed ServerThread's to leak, in the sense that they could end up in limbo in a non-running state. But on the client side the connection would look usable, so it would get picked for an invocation which would eventually time out. It sounds like you have a mixture of old and new code, so I'm wondering if you have an older version of ServerThread.run().

        • 1. Re: JBREM-786:
          jahlborn

          hi,
          sorry i haven't replied in a while. i kind of got away from this for a bit. anyway, we certainly have a mix of code. just fyi, we are using jboss 4.0.5.GA in production, so my changes are built on top of the remoting version in that release, which i believe is actually JBossRemoting 1.4.3.

          to respond to your question about the leaky connections, i believe we have that fix: that is the change which extends the "synchronized(this)" block up around the clientpool updates in ServerThread.run, correct?

          in our application, we have a high number of relatively quick remote calls happening on ~60 threads in each client app server. Since we had ~10 client servers running ~60 threads all banging on the same remote server, we easily hit the 300 thread maximum in the server thread pool. this means that new requests to the server can cause "evict()" to be called on the clientpool (in SocketServerInvoker). This eviction will cause a socket to be closed after an in-flight call ends. On the client side, this will cause a pooled socket to be fine for one call, and then throw an EOFException the next time it is used. With the retry logic backported, it would seem highly unlikely that a single request would see 3 closed client sockets in a row (especially with the "flush" call before the last attempt). However, when we fired up the system at full speed, we would see at least one of these complete failures per "job" (which could be ~100k remote calls to the remote server among all the client boxes).

          the flushConnectionPool() call on the last attempt was obviously an attempt to mitigate this problem, however it still does not gaurantee that the client will get a fresh connection on the next attempt, only makes it more likely. apparently, in our application, this was not a good enough effort as we still saw the issue (albeit infrequently). When i implemented the changes included in the patch, however, this problem stopped occurring at all.

          It seems clear, though, that this patch would be more appropriate for the 2.0 code base, not the 1.4.x code base (unless there is more backporting planned).

          • 2. Re: JBREM-786:
            ron_sigal

            Hi James,

            Thanks for sticking around, and thanks for the elucidation. Yes. I get it. I'll do it. Yes.

            "Molly Bloom" wrote:

            yes I said yes I will Yes.