1 Reply Latest reply on Jan 9, 2012 3:23 AM by jfclere

    Routing information corruption after an ungraceful worker shutdown

    asyomichev

      If a worker is killed with SIGKILL before it has a chance to go through the full graceful shutdown (MCMP DISABLE-APP/REMOVE-APP are not sent), and then restarted on the same machine but on a different port, the normal scenario seems to be to label the existing node “REMOVED”, and create a new one on the next  CONFIG/ENABLE-APP cycle. At times this works not exactly as expected, but as if the REMOVED node is still being considered for routing. When this happens, a failing backend connection attempt to the old dead port puts the worker into “ERROR” state, manifesting as a “503 Service Temporarily Unavailable” error returned for the application context until httpd is restarted.

       

      This happens randomly, and seems more likely when httpd is more heavily loaded, so I suspect there may be a race condition between an httpd worker servicing the CONFIG MCMP request with the new port and another one grabbing the old node with the incorrect node before it is labeled “REMOVED”, or something like that.

       

      How is dirty restart of a worker on a new port supposed to be handled by design? Did anyone notice it before?

       

      Environment: Apache/2.2.15, mod_cluster-1.1.0.Final-src-ssl , Tomcat 6.0.20