4 Replies Latest reply on Jun 24, 2013 11:29 AM by mbabacek

    mod_cluster not dropping workers after abrupt stop.

    mriv

      During graceful shutdowns mod_cluster will remove the dead workers relatively quick as expected.  When JBoss EAP 5 is abruptly killed (kill -9 or unexpected error) the hosts and its workers will remain in mod_cluster-manager for 5 minutes.

       

      I am running

      EAP 5.2.0 with mod_cluster-1.0.10.GA_CP04

       

      httpd-2.2.22-14.ep6.el6.x86_64 with mod_cluster-native-1.2.3-2.Final.ep6.el6.x86_64  using RHEL native httpd ep mod_cluster libs.

       

       

      During the abrupt EAP Kill here is the httpd logs:

      [Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

      [Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

      [Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

      [Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

      [Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

      [Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

      [Fri Jun 21 12:26:27 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

      [Fri Jun 21 12:26:27 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

      [Fri Jun 21 12:26:27 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

      [Fri Jun 21 12:26:27 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

      [Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

      [Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

      [Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

      [Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

      [Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

      [Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

      [Fri Jun 21 12:26:32 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

      [Fri Jun 21 12:26:32 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

      [Fri Jun 21 12:26:32 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

      [Fri Jun 21 12:26:32 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

      [Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

      [Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

      [Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

      [Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

      [Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

      [Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

      [Fri Jun 21 12:26:37 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

      [Fri Jun 21 12:26:37 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

      [Fri Jun 21 12:26:37 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

      [Fri Jun 21 12:26:37 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

      [Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

      [Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

      [Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

      [Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

      [Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

      [Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

      [Fri Jun 21 12:26:42 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

      [Fri Jun 21 12:26:42 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

      [Fri Jun 21 12:26:42 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

      [Fri Jun 21 12:26:42 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

      [Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

      [Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

      [Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

      [Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

      [Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

      [Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

      [Fri Jun 21 12:26:47 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

      [Fri Jun 21 12:26:47 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

      [Fri Jun 21 12:26:47 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

      [Fri Jun 21 12:26:47 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

      [Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(1949): proxy: ajp: retrying the worker for (10.1.1.10)

      [Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(1955): proxy: ajp: worker for (10.1.1.10) has been marked for retry

      [Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(2011): proxy: ajp: has acquired connection for (10.1.1.10)

      [Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(2067): proxy: connecting ajp://10.1.1.10:8109/ to 10.1.1.10:8109

      [Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(2193): proxy: connected / to 10.1.1.10:8109

      [Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(2444): proxy: ajp: fam 2 socket created to connect to 10.1.1.10

      [Fri Jun 21 12:26:52 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.1.1.10:8109 (10.1.1.10) failed

      [Fri Jun 21 12:26:52 2013] [error] ap_proxy_connect_backend disabling worker for (10.1.1.10)

      [Fri Jun 21 12:26:52 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

      [Fri Jun 21 12:26:52 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.1.1.10)

      ....

      ....

      .....

      [Fri Jun 21 12:31:22 2013] [error] (111)Connection refused: proxy: ajp: attempt to connect to 10.158.246.54:8109 (10.158.246.54) failed

      [Fri Jun 21 12:31:22 2013] [error] ap_proxy_connect_backend disabling worker for (10.158.246.54)

      [Fri Jun 21 12:31:22 2013] [debug] mod_proxy_cluster.c(1568): proxy_cluster_try_pingpong: can't connect to backend

      [Fri Jun 21 12:31:22 2013] [debug] proxy_util.c(2029): proxy: ajp: has released connection for (10.158.246.54)

      [Fri Jun 21 12:31:23 2013] [debug] mod_proxy_cluster.c(626): update_workers_node starting

      [Fri Jun 21 12:31:23 2013] [debug] mod_proxy_cluster.c(645): update_workers_node done

       

      Apache mod_cluster.conf config

       

        Listen 10.1.2.5:6666

        <VirtualHost 10.1.2.5:6666>

          <Directory />

            Order deny,allow

            Deny from all

            Allow from all

          </Directory>

          ServerAdvertise on

          AdvertiseFrequency 5

          EnableMCPMReceive

          AdvertiseGroup 239.255.200.101

          CreateBalancers 0

          KeepAliveTimeout 30

          MaxKeepAliveRequests 0

          <Location /mod_cluster_manager>

            SetHandler mod_cluster-manager

            Order deny,allow

            Deny from all

            Allow from all

         </Location>

        </VirtualHost>

       

      jboss mod_cluster server-side config ...

      "HAModCluster"

      <property name="proxyList">${jboss.modcluster.proxyList:}</property>

      <property name="advertise">true</property>

          <property name="advertiseGroupAddress">${jboss.modcluster.advertise.address,jboss.partition.udpGroup:224.0.1.105}</property>

          <property name="advertisePort">${jboss.modcluster.advertise.port:23364}</property>

          <property name="stickySession">true</property>

          <property name="stickySessionForce">true</property>

          <property name="stickySessionRemove">true</property>

          <property name="maxAttempts">1</property>

          <property name="workerTimeout">-1</property>

          <property name="balancer">${jboss.modcluster.balancer}</property>

       

      I am looking for a way to have mod_cluster drop the workers faster instead of having to restart all apaches.

       

      Thanks

        • 1. Re: mod_cluster not dropping workers after abrupt stop.
          mriv

          I also need to mention I added

          <property name="nodetimeout">5</property> (in seconds)

          node: [3:3],Balancer: dev1,JVMRoute: jbodl02_estoredev1,Domain: [],Host: 10.1.2.5,Port: 8109,Type: ajp,flushpackets: 0,flushwait: 10,ping: 10,smax: 61,ttl: 60,timeout: 5

          node: [4:4],Balancer: dev1,JVMRoute: jbodl01_estoredev1,Domain: [],Host: 10.1.2.4,Port: 8109,Type: ajp,flushpackets: 0,flushwait: 10,ping: 10,smax: 61,ttl: 60,timeout: 5

           

          And I am still seeing the 5 minute timeout.

          • 2. Re: mod_cluster not dropping workers after abrupt stop.
            jfclere
            1 of 1 people found this helpful
            • 3. Re: mod_cluster not dropping workers after abrupt stop.
              mriv

              The link takes me to

               

              3.4.4. WaitForRemove


              That has a default of 10 seconds(This seems to work properly.  Once all contexts are gone its takes 10 seconds for a host to be removed).  I tried searching for the documentation for any 5 minute default timeout related to httpd but unfortunatley I cannot find it.  Which is the parameter that will help me override the 5 minute default for timedout hosts?

               

              Thanks

              • 4. Re: mod_cluster not dropping workers after abrupt stop.
                mbabacek

                Jean once wrote: Well, it is LBstatusRecalTime * 60 (and LBstatusRecalTime default value is 5 seconds). So the default value is 5 minutes.

                 

                Let me explain: The logic looks like this:

                (trying to ping the node, incrementing num_failure_idle etc...)


                if (ou->mess.num_failure_idle > 60) {

                    /* Failing for 5 minutes: time to mark it removed */

                    ou->mess.remove = 1;

                    ou->updatetime = now;

                }

                 

                The aforementioned piece of code is being called with every call to update_workers_lbstatus function,

                and that is every LBstatusRecalTime seconds. So, if you had been calling this every 5 seconds at least 60 times,

                clearly, at least 5 minutes passed.

                 

                Furthermore, the WaitForRemove logic in remove_removed_node function:


                if (ou->mess.remove && (now - ou->updatetime) >= wait_for_remove &&

                (now - ou->mess.lastcleantry) >= wait_for_remove) {

                So, the aforementioned introduces a delay between the state where the node is removed and the state

                where the node is actually forgotten. It is noteworthy that the balancer might decide to switch the

                worker to removed much earlier, e.g. if it fails to respond to a request...

                 

                It might finally become clear if you consider the wole loop:

                 


                /* Create new workers if the shared memory changes */

                update_workers_node(conf, pool, s, 0);

                /* removed nodes: check for workers */

                remove_workers_nodes(conf, pool, s);          REMOVED

                /* cleanup removed node in shared memory */

                remove_removed_node(pool, s);                  FORGOTTEN

                /* Calculate the lbstatus for each node */

                update_workers_lbstatus(conf, pool, s);

                 

                @Jean: I hope I didn't write anything misleading

                @Matt: HTH