4 Replies Latest reply on Jun 24, 2013 11:29 AM by mbabacek

mod_cluster not dropping workers after abrupt stop.

mriv Jun 21, 2013 12:37 PM

During graceful shutdowns mod_cluster will remove the dead workers relatively quick as expected. When JBoss EAP 5 is abruptly killed (kill -9 or unexpected error) the hosts and its workers will remain in mod_cluster-manager for 5 minutes.

I am running

EAP 5.2.0 with mod_cluster-1.0.10.GA_CP04

httpd-2.2.22-14.ep6.el6.x86_64 with mod_cluster-native-1.2.3-2.Final.ep6.el6.x86_64 using RHEL native httpd ep mod_cluster libs.

During the abrupt EAP Kill here is the httpd logs:

[Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

[Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

[Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

[Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

[Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

[Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

[Fri Jun 21 12:26:27 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

[Fri Jun 21 12:26:27 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

[Fri Jun 21 12:26:27 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

[Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

[Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

[Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

[Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

[Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

[Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

[Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

[Fri Jun 21 12:26:32 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

[Fri Jun 21 12:26:32 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

[Fri Jun 21 12:26:32 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

[Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

[Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

[Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

[Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

[Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

[Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

[Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

[Fri Jun 21 12:26:37 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

[Fri Jun 21 12:26:37 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

[Fri Jun 21 12:26:37 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

[Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

[Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

[Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

[Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

[Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

[Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

[Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

[Fri Jun 21 12:26:42 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

[Fri Jun 21 12:26:42 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

[Fri Jun 21 12:26:42 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

[Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

[Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

[Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

[Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

[Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

[Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

[Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

[Fri Jun 21 12:26:47 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

[Fri Jun 21 12:26:47 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

[Fri Jun 21 12:26:47 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

[Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

[Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

[Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

[Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

[Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

[Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

[Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

[Fri Jun 21 12:26:52 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

[Fri Jun 21 12:26:52 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

[Fri Jun 21 12:26:52 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

[Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

....

.....

[Fri Jun 21 12:31:22 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.158.246.54:8109 (10.158.246.54) failed

[Fri Jun 21 12:31:22 2013] [error] ap_proxy_connect_backend disabling worker for (10.158.246.54)

[Fri Jun 21 12:31:22 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

[Fri Jun 21 12:31:22 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.158.246.54)

[Fri Jun 21 12:31:23 2013] [debug] mod_proxy_cluster.c(626): update_workers_node starting

[Fri Jun 21 12:31:23 2013] [debug] mod_proxy_cluster.c(645): update_workers_node done

Apache mod_cluster.conf config

Listen 10.1.2.5:6666

Order deny,allow

Deny from all

Allow from all

</Directory>

ServerAdvertise on

AdvertiseFrequency 5

EnableMCPMReceive

AdvertiseGroup 239.255.200.101

CreateBalancers 0

KeepAliveTimeout 30

MaxKeepAliveRequests 0

SetHandler mod_cluster-manager

Order deny,allow

Deny from all

Allow from all

</Location>

</VirtualHost>

jboss mod_cluster server-side config ...

"HAModCluster"

<property name="proxyList">${jboss.modcluster.proxyList:}</property>

<property name="advertiseGroupAddress">${jboss.modcluster.advertise.address,jboss.partition.udpGroup:224.0.1.105}</property>

<property name="advertisePort">${jboss.modcluster.advertise.port:23364}</property>

<property name="balancer">${jboss.modcluster.balancer}</property>

I am looking for a way to have mod_cluster drop the workers faster instead of having to restart all apaches.

Thanks

1. Re: mod_cluster not dropping workers after abrupt stop.

mriv Jun 21, 2013 2:02 PM (in response to mriv)

I also need to mention I added
<property name="nodetimeout">5</property> (in seconds)
node: [3:3],Balancer: dev1,JVMRoute: jbodl02_estoredev1,Domain: [],Host: 10.1.2.5,Port: 8109,Type: ajp,flushpackets: 0,flushwait: 10,ping: 10,smax: 61,ttl: 60,timeout: 5
node: [4:4],Balancer: dev1,JVMRoute: jbodl01_estoredev1,Domain: [],Host: 10.1.2.4,Port: 8109,Type: ajp,flushpackets: 0,flushwait: 10,ping: 10,smax: 61,ttl: 60,timeout: 5

And I am still seeing the 5 minute timeout.
Actions
2. Re: mod_cluster not dropping workers after abrupt stop.

jfclere Jun 24, 2013 1:39 AM (in response to mriv)

5 minutes is the default timeout value: http://docs.jboss.org/mod_cluster/1.2.0/html/native.config.html#d0e538
1 of 1 people found this helpful
Actions
3. Re: mod_cluster not dropping workers after abrupt stop.

mriv Jun 24, 2013 10:41 AM (in response to mriv)

The link takes me to

3.4.4. WaitForRemove

That has a default of 10 seconds(This seems to work properly. Once all contexts are gone its takes 10 seconds for a host to be removed). I tried searching for the documentation for any 5 minute default timeout related to httpd but unfortunatley I cannot find it. Which is the parameter that will help me override the 5 minute default for timedout hosts?

Thanks
Actions

4. Re: mod_cluster not dropping workers after abrupt stop.

mbabacek Jun 24, 2013 11:29 AM (in response to mriv)

Jean once wrote: Well, it is LBstatusRecalTime * 60 (and LBstatusRecalTime default value is 5 seconds). So the default value is 5 minutes.

Let me explain: The logic looks like this:

(trying to ping the node, incrementing num_failure_idle etc...)

	if (ou->mess.num_failure_idle > 60) {
	/* Failing for 5 minutes: time to mark it removed */
	ou->mess.remove = 1;
	ou->updatetime = now;
	}

The aforementioned piece of code is being called with every call to update_workers_lbstatus function,

and that is every LBstatusRecalTime seconds. So, if you had been calling this every 5 seconds at least 60 times,

clearly, at least 5 minutes passed.

Furthermore, the WaitForRemove logic in remove_removed_node function:

	if (ou->mess.remove && (now - ou->updatetime) >= wait_for_remove &&
	(now - ou->mess.lastcleantry) >= wait_for_remove) {

So, the aforementioned introduces a delay between the state where the node is removed and the state

where the node is actually forgotten. It is noteworthy that the balancer might decide to switch the

worker to removed much earlier, e.g. if it fails to respond to a request...

It might finally become clear if you consider the wole loop:

	/* Create new workers if the shared memory changes */
	update_workers_node(conf, pool, s, 0);
	/* removed nodes: check for workers */
	remove_workers_nodes(conf, pool, s); REMOVED
	/* cleanup removed node in shared memory */
	remove_removed_node(pool, s); FORGOTTEN
	/* Calculate the lbstatus for each node */
	update_workers_lbstatus(conf, pool, s);

@Jean: I hope I didn't write anything misleading

@Matt: HTH

Go to original post

3.4.4. WaitForRemove