Ping timeout acting as worker timeout
akarl Feb 11, 2010 3:43 PMI have a transaction which can take longer than 10 seconds to process and yes this is part of the design and yes to me this is ok. Now what I'm seeing is that there is a potential problem with the http://www.jboss.org/mod_cluster/java/properties.html#ping setting and how it is applied. When my transactions take longer than 10 seconds I begin to see the following when using 1.0.3.GA.
{code}[Thu Feb 11 13:20:38 2010] [error] ajp_cping_cpong: apr_socket_recv failed
[Thu Feb 11 13:23:11 2010] [error] proxy: dialog to 10.182.13.58:8009 (10.182.13.58) failed [Thu Feb 11 13:23:16 2010] [error] proxy: CLUSTER: (balancer://mycluster). All workers are in error state [Thu Feb 11 13:23:20 2010] [error] proxy: ajp: disabled connection for (10.182.13.58){code}
on 1.0.0.GA the error is similar but there is one significant difference.
{code}[Thu Feb 11 12:56:41 2010] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
[Thu Feb 11 12:56:41 2010] [error] ajp_read_header: ajp_ilink_receive failed [Thu Feb 11 12:56:41 2010] [error] (120006)APR does not understand this error code: proxy: read response failed from 10.182.13.58:8009 (10.182.13.58){code}
Note that in the 1.0.0 error the node is NOT disabled.
There is a lot going on here. I think that the fundamental problem is that the Ping/Pong functionality seems to be tied to the length of the transaction with the node. I believe that this is not inline with the design which specifies that the Ping/Pong happens prior to the transaction with the node and that is further backed up by the existence of a workerTimeout setting.
Now if I change the ping setting to 30000 (by the way the documentation is misleading in that it claims a default of 10 which is not true, the default is in milliseconds and so should be 10000) my transactions can last up to 30 seconds. I've tested this rather extensively using all sorts of different transaction lengths (I wrote a web service which lasts a variable length depending on a passed parameter) and everything points to the fact that ping timing is actually more of a transaction timeout.
The really nasty piece here is that in the 1.0.3 log you can see that not only did my transaction fail but the node is disabled as well. This means that a single transaction can bring down the entire node simply by taking too long.
I didn't see this description anywhere in current issues or posts but I think this could easily be related to some of the other issues with 500 and 503 errors being tossed by mod_cluster.