0 Replies Latest reply on Apr 17, 2015 11:45 AM by mchughtai

Wildfly 8.2 is intermittently terminating connections to Apache

mchughtai Apr 17, 2015 11:45 AM

We recently upgraded our production environment from jboss 4.2.2 to Wildfly 8.2.0. We have several instances (nodes) of wildfly (running in standalone mode) on two windows servers. We are using Apache 2.2/mod_jk as a load balancer to manage traffic between the nodes. Minutes after deployment is complete we begin to see the following error in the apache/mod_jk logs:

Wed Apr 15 15:59:05 2015][10820:2940] [info] mod_jk.c (2615): Service error=0 for worker=loadbalancer_A

[Wed Apr 15 15:59:05 2015][10820:2940] [info] jk_lb_worker.c (1436): Forcing recovery once for 1 workers

[Wed Apr 15 15:59:05 2015][10820:2940] [info] jk_ajp_common.c (1143): (node1) can't receive the response header message from tomcat, tomcat (x.x.x.x:8009) has forced a connection close for socket 4048

[Wed Apr 15 15:59:05 2015][10820:2940] [error] jk_ajp_common.c (1962): (node1) Tomcat is down or refused connection. No response has been sent to the client (yet)

[Wed Apr 15 15:59:05 2015][10820:2940] [info] jk_ajp_common.c (2447): (node1) sending request to tomcat failed (recoverable), (attempt=1)

[Wed Apr 15 15:59:05 2015][10820:2940] [error] jk_ajp_common.c (2466): (node1) connecting to tomcat failed.

[Wed Apr 15 15:59:05 2015][10820:2940] [info] jk_lb_worker.c (1384): service failed, worker node1 is in error state

[Wed Apr 15 15:59:05 2015][10820:2940] [info] jk_lb_worker.c (1453): All tomcat instances failed, no more workers left (attempt=1, retry=0)

[Wed Apr 15 15:59:05 2015][10820:2940] [info] jk_lb_worker.c (1453): All tomcat instances failed, no more workers left (attempt=0, retry=1)

[Wed Apr 15 15:59:05 2015][10820:2940] [info] jk_lb_worker.c (1453): All tomcat instances failed, no more workers left (attempt=1, retry=1)

[Wed Apr 15 15:59:05 2015][10820:2940] [info] jk_lb_worker.c (1464): All tomcat instances are busy or in error state

[Wed Apr 15 15:59:05 2015][10820:2940] [error] jk_lb_worker.c (1469): All tomcat instances failed, no more workers left

When the connection is closed, apache puts the wildfly node in error state and re-directs the call to one of the other nodes. The original node recovers and begins to process requests again. The problem is that even though the connection was closed, wildfly itself did not have any errors and the original request is processed in our system. When apache re-directs the original request to a second node, that request is also processed in our system. So we end up processing the request twice. The redirect from apache happens within milliseconds, so its difficult for us to catch the duplicate request. We have only seen this issue in our production environment, we haven't been able to replicate it in our staging/sandbox environments.

We've tried the following configuration changes to address the issue but none have helped:

- increase apache ping timeout

- increase apache socket_timeout

- set apache socket_keep_alive = true (even though there is no firewall between apache and wildfly)

- increase database connections

- increase thread count on io subsystem in wildfly

I've tried monitoring the wildfly nodes through visual vm, reviewed thread dumps and increased logging but I haven't been able to find anything that would explain why wildfly is terminating the connection.

Has anyone else experienced a similar issue? If so, how did you go about resolving it? Any help would be greatly appreciated.