10 Replies Latest reply on Sep 24, 2009 12:24 PM by kilgoretroutjr

404 and

kilgoretroutjr Sep 10, 2009 1:41 PM

Hi,

I'm evaluating mod_cluster with apache 2.2 and JBoss EAP 4.3, and I'm getting some strange behavior.
After configuring everything, all seems to work fine at first, my requests are properly balanced. But as soon as I start to stress-test the config with the included demo-client, I find that after a while some request are failing with 404 errors. The failures occur in a bursty style (all failures happen at the very same moment - or at least in the same second).
After thousands of correct responses I get some 404-s to a load-balanced url, and one second later everything gets back to normal. This behavior occurs for one second every 1-3 minutes. There is no sign of any overload anywhere (OS, network, etc.).
After turning on the debug level logs in apache, I've found some strange messages, that seems to be related to the problem, and which I can't explain:

[Thu Sep 10 19:12:49 2009] [debug] mod_proxy_cluster.c(933): get_balancer_by_node balancer NOT found

Before and after these burst the behavior is as expected and the log says

[Thu Sep 10 19:11:42 2009] [debug] mod_proxy_cluster.c(914): get_balancer_by_node found context /load-demo

My config is the minimal config found on the wiki. I've tried prefork and worker mpm modes in apache, it doesn't seem to make a difference.

Any clues on this?

Thanks,
Gabor

1. Re: 404 and

jfclere Sep 11, 2009 2:50 AM (in response to kilgoretroutjr)

It seems you loose the connection between the nodes and httpd.
Looks for CONFIG in the access_log file when the error happens.

If that the case, in the VirtualHost that receives the MCMP messages.
Add the following:
KeepAliveTimeout 60
MaxKeepAliveRequests 0
Actions

2. Re: 404 and

kilgoretroutjr Sep 15, 2009 2:18 PM (in response to kilgoretroutjr)

Thanks for the tip, it didn't work though :-(
Further filtering the debug level log messages shows an interesting symptom.It seems like at time of the 404 errors the "update_workers_node" process works differently than other times...

Here is what i think is normal:

[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(575): update_workers_node starting
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(390): add_workers_node: Create balancer balancer://mycluster
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(228): Created: worker for ajp://0.0.0.0:8209
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(323): proxy: initialized single connection worker 1 in child 10616 for (0.0.0.0)
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(364): Created: worker for ajp://0.0.0.0:8209 1 (status): 1
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(446): add_workers_node done
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(407): add_workers_node: Using balancer balancer://mycluster
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(228): Created: worker for ajp://0.0.0.0:8009
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(323): proxy: initialized single connection worker 2 in child 10616 for (0.0.0.0)
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(364): Created: worker for ajp://0.0.0.0:8009 2 (status): 1
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(446): add_workers_node done
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(596): update_workers_node done
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(575): update_workers_node starting
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(407): add_workers_node: Using balancer balancer://mycluster
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(446): add_workers_node done
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(407): add_workers_node: Using balancer balancer://mycluster
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(446): add_workers_node done
[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(596): update_workers_node done

and these runs seem faulty to me:

[Tue Sep 15 19:50:00 2009] [debug] proxy_util.c(1806): proxy: grabbed scoreboard slot 0 in child 10639 for worker proxy:reverse
[Tue Sep 15 19:50:00 2009] [debug] proxy_util.c(1825): proxy: worker proxy:reverse already initialized
[Tue Sep 15 19:50:00 2009] [debug] proxy_util.c(1922): proxy: initialized single connection worker 0 in child 10639 for (*)
[Tue Sep 15 19:50:00 2009] [debug] mod_proxy_cluster.c(575): update_workers_node starting
[Tue Sep 15 19:50:00 2009] [debug] mod_proxy_cluster.c(596): update_workers_node done
[Tue Sep 15 19:50:00 2009] [error] [client 127.0.0.1] File does not exist: /htdocs
[Tue Sep 15 19:50:00 2009] [debug] mod_proxy_cluster.c(575): update_workers_node starting
[Tue Sep 15 19:50:00 2009] [debug] mod_proxy_cluster.c(596): update_workers_node done
[Tue Sep 15 19:50:00 2009] [error] [client 127.0.0.1] File does not exist: /htdocs
[Tue Sep 15 19:50:00 2009] [debug] mod_proxy_cluster.c(575): update_workers_node starting
[Tue Sep 15 19:50:00 2009] [debug] mod_proxy_cluster.c(596): update_workers_node done

You can see the reason of the 404s as well (instead of sending the request to one of the workers, apache is trying to find the files on the local file system...)

And this is the filter I've used to find the supposedly relevant parts:

egrep -v "ajp_marshal_into_msgb|ajp_unmarshal_response|got 0 bytes of data|APR_BUCKET_IS_EOS|ajp_ilink_received|ajp_parse_type: got|get_balancer_by_node found context /load-demo|proxy: data to read|proxy: got response from 0.0.0.0:8009|for stickysession JSESSIONID|ajp_handle_cping_cpong|AJP: has released connection for|proxy: byrequests balancer DONE|Running scheme balancer handler \(attempt 0\)|proxy: got response from 0.0.0.0:8209|get_balancer_by_node found context"

I guess that the reason for the 404s might be this misbehavior of update_workers_node, but I can't see the reason for it...
Any thoughts?

3. Re: 404 and

jfclere Sep 16, 2009 2:29 AM (in response to kilgoretroutjr)

"[Tue Sep 15 19:49:59 2009] [debug] mod_proxy_cluster.c(364): Created: worker for ajp://0.0.0.0:8209" that is MODCLUSTER-91
You need to have a valid address in the Connector of server.xml
Actions
4. Re: 404 and

kilgoretroutjr Sep 16, 2009 8:29 AM (in response to kilgoretroutjr)

Thank for the tip, but that didn't help either. The behavior is just the same.

What we have found is that setting the backgroundProcessorDelay parameter in server.xml to 10 seems to make the error less common but still existent. As we're using JBoss EAP 4.3, the default value for this was -1 according to the documentation. We did not explicitly define this before.
The pattern is all the same: normal behavior, then lots of 404-s for a second, then everything gets back to normal for minutes again. The length of the normal period seems to be random...

We've already tested lots of different configs (RHEL/Ubuntu, Apache 2.2.7/Apache 2.2.13, Apache prefork/worker mode, 1/2 working nodes, VMWare/KVM/raw metal, etc.), but all of these configs show the same error.

Any more ideas, or debugging methods?

Thanks in advance,
Gabor
Actions
5. Re: 404 and

jfclere Sep 22, 2009 10:03 AM (in response to kilgoretroutjr)

I will just ask you answer the question I asked some time ago:
"It seems you loose the connection between the nodes and httpd.
Looks for CONFIG in the access_log file when the error happens."

Do you see CONFIG in the access_log file at the time of error?
Actions
6. Re: 404 and

kilgoretroutjr Sep 22, 2009 10:40 AM (in response to kilgoretroutjr)

Sorry I've skipped that question: no, I don't see any CONFIG-s around these times.
Actions
7. Re: 404 and

jfclere Sep 23, 2009 5:37 AM (in response to kilgoretroutjr)

Do you use MaxRequestsPerChild in httpd.conf?

An interesting try would be to use different value of MaxRequestsPerChild to see if lowering its value increase the 404 errors (note that MaxRequestsPerChild default is 10000).
Actions
8. Re: 404 and

kilgoretroutjr Sep 23, 2009 1:56 PM (in response to kilgoretroutjr)

The MaxRequestsPerChild setting was explicitly 0 (no max requests) in my config, but changing this to an extreme low (1), "normal" (10000) or extreme high but not infinite (1000000) value did not change the behavior. Using the demo client the number of live client threads stabilizes at around 20-30, all others fall out with 404 within the first few minutes of the stress testing. After some time (talking about hours) each of the clients get a 404 and go dead.
My client settings are: 80 clients, 40 second startup time, 100ms sleep time.

The probability of the occurrences seems to be rising with the load, so it seems like a threading or performance issue(?)... the strange thing is that I don't see any bottlenecks with any of my monitoring tools (JBoss, Apache, OS and so on), so it's probably threading.
The other unambiguous characteristic of the symptom is that it's bursty. Lots of 404's come in the very same second.

Thanks for the help so far, I'd be glad if you had more suggestions. I'd really like to put mod_cluster to production (my company is adopting lots of JBoss products), but this issue needs to be resolved first.
Actions
9. Re: 404 and

jfclere Sep 24, 2009 2:35 AM (in response to kilgoretroutjr)

If the problem also occurs on prefork it isn't a threading issue but something else...

Could you please open a mod_cluster JIRA and attach the error_log and access_log files to the JIRA..
Actions
10. Re: 404 and

kilgoretroutjr Sep 24, 2009 12:24 PM (in response to kilgoretroutjr)

This is MODCLUSTER-102 now: https://jira.jboss.org/jira/browse/MODCLUSTER-102
Please let me know if you need more information.
Actions

Go to original post