Design Notes for Graceful Shutdown upon receipt of a SIGTERM

Version 3

    Issue analysis for the request to invoke the WildFly graceful shutdown logic when the server process is terminated via an operating system signal and not by a management operation.

     

    Overview

     

    WildFly 10 introduced the "graceful shutdown" feature whereby as part of shutdown the server would move into a suspended mode whereby new requests would be rejected but in process requests would be given a configurable amount of time to complete before the server would shut down. See Admin Guide - Latest WildFly Documentation  - Project Documentation Editor for details on this feature. However, currently graceful shutdown only occurs if the server is stopped via a management operation. If shutdown occurs via an OS signal (e.g. from a kill -15 or a Ctrl-C) then the server shutdown is not graceful. This makes managing shutdown of WildFly servers more difficult for tooling, e.g. Kubernetes, that needs to use general purpose facilities like OS signals to manage a wide variety of types of processes. So the idea here is to enable graceful shutdown support when the server process is terminated via a signal.

     

    Issue Metadata

     

    EAP ISSUE: https://issues.jboss.org/browse/EAP7-732

     

    RELATED ISSUES:

    [WFCORE-3073] Handle TERM gracefully - JBoss Issue Tracker

     

    DEV CONTACTS: Brian Stansberry

     

    QE CONTACTS: Jan Stourac

     

    AFFECTED PROJECTS OR COMPONENTS: WildFly Core kernel, OpenShift

     

    OTHER INTERESTED PARTIES:

     

    Requirements

     

    Hard Requirements

     

    • Perform graceful shutdown in a manner consistent with what occurs when the 'shutdown' management operation is performed if the following are true:
      • The JVM has received an OS signal that causes shutdown hooks to run.
      • The state of the server is RUNNING; i.e. the server has completed boot and has not begun any sort of shutdown or reload.
    • Management operations like 'shutdown' allow configuration via an operation parameter of the maximum time to wait for in-process requests to complete before proceeding with shutdown. This configuration option of course isn't available in the OS signal case, so instead this value can be configured via a new org.wildfly.sigterm.suspend.timeout system property.
      • The value of the property is an integer that represents the number of seconds to wait for requests to complete.
      • The default value is 0.
      • There is no value (or lack of value) that means "wait indefinitely".

     

    Nice-to-Have Requirements

     

    • Disable this new behavior if the org.wildfly.sigterm.suspend.timeout system property is not set. This isn't really a nice-to-have; it's more a "do this if necessary". The reason to do this would be to be conservative about the contexts in which this behavior is enabled, e.g. to give it more bake in situations where it is wanted before making it the behavior always. The ideal behavior is to always do the graceful shutdown, but with a 0 ms timeout if the property is not configured. This is how shutdown is done when a management op is used. If we did initially require setting the system property, we could always remove that requirement in a later release without doing something incompatible.
    • Have a proper attribute in the management model (e.g. in subsystem=core-management) instead of or in addition to using a system property for configuration. This is a nice-to-have as my expectation is the target audience for this is oriented toward generic management mechanisms rather than having to ensure an attribute in the user-controlled WildFly xml config is set. Requiring configuration of this value via a management attribute and disallowing config via system property, which is a standard requirement for most WildFly configuration settings, would actually be an anti-requirement in this case.
    • Support this behavior for a domain mode server. It's my expectation that we will support this, as the code involved is the same for a standalone or domain mode server. But I'm describing it as a nice-to-have just in case some unforeseen issue arises. The key use case here is in the cloud and in cloud scenarios our emphasis is on standalone servers.

     

    Non-Requirements

     

    • Determine what sort of signal (e.g. SIGTERM vs SIGINT) is causing the JVM to exit and vary the behavior based on this information.
    • Do anything different from current behavior if the server is not in RUNNING state when the signal is handled or if the server kernel has indicated the process needs to termninate due to a failure.
    • Any special graceful shutdown behavior when shutdown is triggered via this mechanism that is different from what occurs when shutdown is triggered by a management operation.
    • Perform graceful shutdown in response to a signal if the VM is configured not to run shutdown hooks (e.g. with the -Xrs command line argument to java).
    • Perform graceful shutdown in response to signals that do not result in shutdown hooks being run (e.g. SIGKILL).
    • Implement some variant of this feature for a domain mode Process Controller or Host Controller processes.