4 Replies Latest reply on Apr 2, 2010 8:47 PM by gnome

What can cause - All workers are in error state

akarl Feb 24, 2010 12:39 PM

[Wed Feb 24 06:54:47 2010] [error] proxy: CLUSTER: (balancer://mycluster). All workers are in error state

Currently running mod_cluster 1.0GA but this is more of a general question. I see the error message above from time to time. Specifically it happened last night for me for ~15 seconds for all incoming connections. My question is what are the known reasons for this to happen?

More detail, I have 2 app servers connecting via the HA AJP method. At that same time I see STATUS messages succeeding from the master app server node so connectivity should not have been an issue. I also don't see any red flags as far as the app servers are concerned. There are no errors in their logs at that time and their CPUs were not loaded (using average load as my balancing metric).

1. Re: What can cause - All workers are in error state

jfclere Feb 25, 2010 6:15 AM (in response to akarl)

Any other error messages before those "All workers are in error state" messages?
Actions
2. Re: What can cause - All workers are in error state

akarl Feb 25, 2010 10:54 AM (in response to jfclere)

Nothing else in the error log. There were 9 instances of the message above which correspond to 9 http calls happening at that time which all ended up failing with error code 503. They were the only errors in the log. In the access log I saw STATUS messages succeeding during this time period.
Actions
3. Re: What can cause - All workers are in error state

altes-kind Mar 24, 2010 3:44 PM (in response to jfclere)

I'm having the same issue with the JBoss CirrAS Images:

[Wed Mar 24 15:36:23 2010] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
[Wed Mar 24 15:36:23 2010] [error] ajp_handle_cping_cpong: ajp_ilink_receive failed
[Wed Mar 24 15:36:23 2010] [error] (120006)APR does not understand this error code: proxy: AJP: cping/cpong failed to 10.192.178.166:8009 (10.192.178.166)
[Wed Mar 24 15:36:23 2010] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
[Wed Mar 24 15:36:23 2010] [error] ajp_handle_cping_cpong: ajp_ilink_receive failed
[Wed Mar 24 15:36:23 2010] [error] (120006)APR does not understand this error code: proxy: AJP: cping/cpong failed to 10.192.178.166:8009 (10.192.178.166)
[Wed Mar 24 15:36:23 2010] [error] (70007)The timeout specified has expired: ajp_ilink_receive() can't receive header
[Wed Mar 24 15:36:23 2010] [error] ajp_handle_cping_cpong: ajp_ilink_receive failed
[Wed Mar 24 15:36:23 2010] [error] (120006)APR does not understand this error code: proxy: AJP: cping/cpong failed to 10.215.18.195:8009 (10.215.18.195)
[Wed Mar 24 15:36:23 2010] [error] proxy: CLUSTER: (balancer://mycluster). All workers are in error state
[Wed Mar 24 15:36:24 2010] [error] proxy: CLUSTER: (balancer://mycluster). All workers are in error state
[Wed Mar 24 15:36:24 2010] [error] proxy: CLUSTER: (balancer://mycluster). All workers are in error state
[Wed Mar 24 15:36:24 2010] [error] proxy: CLUSTER: (balancer://mycluster). All workers are in error state
[Wed Mar 24 15:36:24 2010] [error] proxy: CLUSTER: (balancer://mycluster). All workers are in error state

Any ideas?
Actions
4. Re: What can cause - All workers are in error state

gnome Apr 2, 2010 8:47 PM (in response to altes-kind)

Hi

Did it cause your application to fail? I also have the same problem when the httpd starts but after like 20-30 seconds(approx), I dont get this error. Also it never caused the crash of my application. So i just ignore this error.

This is what I know about the above issue (courtesy Paul Ferraro ):

"This is probably due to the short period of time (i.e. 17 seconds) between when the CONFIG command is sent to the proxy (after the WebServer is started), and when the AJP connector (over which ajp_cping_cpong operates) is itself started. In the AS, the connectors are the very last things to start - this is evident in the AS log. So, for the time being - ignore these messages. A workaround for this would be to defer all mod_cluster startup until the connectors are started."

May be the same reason is behind your error as well.

Hope this helps.

Neeraj
Actions

Go to original post