I am trying to understand routing behavior of mod_cluster 1.1.0.Final (Tomcat 6.0.20 with AJP connector, Apache/2.2.15) in the presence of requests that are taking a very long time.
I am seeing that requests time out after 5 minutes of no response. I am not sure where this timeout is set (found no references to a 300 second default anywhere in the docs), but it is not a big deal for me, as 5 minutes is a reasonable (even generous) boundary in my opinion.
What seems to happen after the request timeout is more curious: the whole worker becomes unavailable for subsequent request routing for 20-30 seconds. This is easy to demonstrate in a configuration with a single worker: when it starts, up to 5 idle AJP connections are created to the worker. As I submit a long running request, and while it is pending, the worker responds to short requests normally (after all, only one AJP connection is tied up). As I continue to ping the application with short requests, they succeed up until the long request times out. From that point on and for a period of 20-30 seconds (time varies quite a bit, in some experiments from 5 up to 60 seconds), every request to this context returns with "503 Service Temporarily Unavailable" as if the worker was down.
- At the point of timeout, the worker is still up and there are 4 remaining AJP connections from httpd to tomcat that could continue serving requests. What is the rationale behind marking the whole worker in "error" state?
- What is the recovery mechanism in this case? Why is it taking tens of seconds?
- I see no related messages in error_log at "debug" level. Are there any other ways to track this behavior?
I am attaching a trivial war file with a "sleep" jsp and a shell script to trigger it; hopefully it will help to reproduce the condition easily. The client script submits a long wait and, after that returns, prints a timestamp and submits a series of small pings until the first successful one. What I would hope to see is that the very first ping succeeds very shortly (tens of milliseconds) after the long request have timed out with a 500. What I see instead is that a number of subsequent pings return 503s one after another, and once a ping succeeds, the final timestamp is about 30 seconds after the first 500 response.
Many thanks for help.