Analysis of "Convenient declaration of server suspend timeout for all operations that take down a server"

Version 8

    Analysis and design document for work on making graceful shutdown more convenient for users of management operations other than the "shutdown" op.

     

    Overview

     

    In WildFly 10 we introduced the server suspend / graceful shutdown feature. This included a new "suspend" management operation, plus a "timeout" param to the existing "shutdown" operation. If timeout was set it would result in the server waiting up to the timeout to suspend before continuing with shutdown, thus providing a 'graceful shutdown'.

    However, there are other ways to trigger a server to stop handling requests besides the low level "shutdown" operation and the high level CLI "shutdown" command. This RFE is to add a "suspend-timeout" parameter to those, thus making it easier to get a graceful shutdown.

    Specifically, users can tell servers to "reload" in various ways, with a reload functioning effectively like a shutdown + restart. And then in a managed domain there are "restart-servers" ops at the domain and server group levels.

    For all of these it is possible for the user to get graceful behavior by first doing a "suspend" on the relevant servers before doing the reload / restart. But it's more user friendly to allow the user to provide a timeout similarly to what we do with shutdown.

     

    Issue Metadata

     

    EAP ISSUE: https://issues.jboss.org/browse/EAP7-601

     

    RELATED ISSUES: [WFCORE-1427] Add a timeout param to reload op and use it for "graceful reload" - JBoss Issue Tracker

     

    DEV CONTACTS:  Brian Stansberry, Yeray Santana Borges (community contributor), Jean-Francois Denise

     

    QE CONTACTS:

     

    AFFECTED PROJECTS OR COMPONENTS: WildFly Core kernel, WildFly Core CLI, HAL

     

    OTHER INTERESTED PARTIES: N/A

     

    Requirements

     

    Hard Requirements

     

    • All low level management operations that result in a server stopping handling of requests should accept a 'suspend-timeout' parameter that controls how long the server should wait for in-flight requests to complete before proceeding with the stop.
      • Meaning of the suspend-timeout should be the same as the 'timeout' param currently available for the 'shutdown' operation. A value of less than 0 means wait indefinitely, a value of 0 means don't wait, a value larger than zero means the server will wait up to this many seconds for all active requests to finish. Default value is 0. Value is in seconds.
      • Operations like 'shutdown' that already have a 'timeout' param are excepted from this requirement; no need to add a semantically equivalent 'suspend-timeout'.
    • Relevant operations include 'reload' for a server or HC and 'restart-servers' on the domain root resource and on the server group resource. An audit should be performed to check for others as well.
    • All high level CLI commands that result in a server stopping handling of requests should also accept a 'suspend-timeout' parameter.
    • For a Host Controller 'reload' operation or high level CLI command, the suspend-timeout applies to the servers, not to the HC process. HC suspend is not supported. Any 'suspend-timeout' is irrelevant if the 'restart-servers' param is set to 'false' since in that case no server will stop normal handling of requests.
    • The 'suspend-timeout' is a per server, not an aggregate overall timeout for operations that affect multiple servers.
    • The 'suspend-timeout' is only for a server suspending before stopping normal request handling, not for the time take for any other aspect of the operation, e.g. the time to restart/reload once the server is suspended.

     

    Nice-to-Have Requirements

     

    • HAL support. This should be a hard requirement for a subsequent release but is a nice to have initially.
    • Reject the use of the 'suspend-timeout' param for an HC reload if the 'restart-servers' param is set to 'false'. If servers are not being restarted there is not 'suspend' involved, so the combination is non-sensical.
      • If this is done it should be done consistently in low level op handling on the server side and in CLI high level command processing  on the client side. IOW the high level command should not accept parameter combinations that are then rejected on the server.
    • Parallelization of multi-server suspend where not already done and where feasible and not an incompatible behavior change.
      • This requirement is not meant to imply that the suspend part of task execution should be separated from other parts (e.g. in a reload case, suspend all servers in a parallel and the proceed on to the rest of reloading the servers.)

     

    Non-Requirements

     

    • For an operation affecting multiple servers, treating the 'suspend-timeout' as being some overall timeout. The timeout applies to each server.

     

    Design Notes

     

    • List of affected low level management operations: [not complete yet]

    [domain@localhost:9990 /] /host=master:reload

    [domain@localhost:9990 /] reload --host=master

    [domain@localhost:9990 /] /host=master/server-config=server-one:reload

    [domain@localhost:9990 /] /host=master/server-config=server-one:restart

    [domain@localhost:9990 /] :restart-servers

    [domain@localhost:9990 /] /server-group=main-server-group:reload-servers

    [domain@localhost:9990 /] /server-group=main-server-group:restart-servers

     

    [standalone@localhost:9990 /] reload

    [standalone@localhost:9990 /] :reload

     

    • Connection timeout configuration for CLI reload operation:

    There is a connection timeout awaiting for server reboots when CLI reload operation is used. This timeout is going to be taken into account when suspend-timeout is used together with CLI reload operation. If suspend-timeout is used with a value bigger than 0, the server reboots could take some additional time to finish and the already configured timeout could not be enough. The proposed solution is to use in this case the CLI reload connection timeout plus the suspend-timeout as connection timeout for this reload operation. When the suspend-timeout is less or equals 0, the connection timeout awaiting for server reboots won't be affected.

     

    • Operation blocking time [TBD]