Analysis for Graceful Shutdown and Quiescing for Messaging

Version 3

    User Story

     

    As an administrator, I can shutdown/suspend/resume my server in a controlled manner so that my ongoing client requests such as in flight transactions are serviced before the shutdown or suspended temporarily until resumed.

    Issue Metadata

     

    EAP issue: https://issues.jboss.org/browse/EAP7-458

    Related issues:

    Dev Contacts: Jeff Mesnil, Artemis (TBD)
    QE Contacts: TBD
    Affected projects or components: Artemis, messaging-activemq subsystem

    Requirements

    • During the quiesce period, Artemis will not accept new external connections.
    • During the quiesce period, Artemis will still accept new internal connections.
    • During the quiesce period, JMS resources involved in in-flight transactions will remain available for transaction completion
    • At the end of the quiesce period, Artemis will be shutdown immediately.
    • It applies to servers that are started with all supported server profiles (i.e. default, default-ha, full and full-ha) in both standalone and domain modes.
    • In a clustered environment if a node in cluster is on graceful shutdown, the other node(s) in the cluster should not log warnings, but infos (or debug or nothing).
    • Take verbosity into consideration; avoid verbose error and warning from the rest of the node in a cluster. In a clustered environment, if a cluster node starts a clean/graceful shutdown, when clients talking to the cluster via JMS; it would not log warning or errors.
    • After suspension, Artemis can resume its work and function normally (new external connections are accepted)

    Design Details

     

    This feature leverages the framework for suspend/resume that is already available in WildFly and used by other subsystems and resources (Servlets, MDBs).

    Artemis Design

    • When suspend is performed, Artemis will refuse to create new external connections (NettyAcceptorFactory will reject new connections)
    • When suspend is performed, Artemis will continue to create internal connection (InVmAcceptorFactory will create new connections)
    • When suspend is performed, existing external connections will remained opened but JMS clients will not be be able to produce/consume messages (exceptions will be raised)
    • When suspend is performed on a cluster node, the node must leave the cluster
    • In a cluster, during quiesce period, external clients trying to connect to the suspended server should be able to connect transparently to another node
    • In a cluster, when an Artemis server is suspended, it should not scale down to another node in the cluster. Scaling down only occurs when the server is shutdown
    • When resume is performed on a node in the cluster, the node must join back the cluster
    • When resume is performed, existing external connections will be able to produce/consume messages again


    Artemis already supports pausing its acceptors but it does not allow to pause only its external acceptors (based on the NettyAcceptorFactory implementation).Artemis should provide the API to suspend/resume its activity at the highest level (ActiveMQServer interface) as it will involve many of its inner components (RemotingService, ReplicationManager, HAPolicy).

     

    A new kind of exception (or error code) will be added to be raised by the clients when they try to produce/consume messages from a suspended server.

    WildFly Design

    The messaging-activemq subsystem will leverage the org.jboss.as.server.suspend.SuspendController to notify its Artemis servers when it should suspend/resume its work.This feature will not add new management operations to the messaging-activemq subsystem. Graceful shutdown is executed globally on WildFly and does not allow to shutdown specific resources only.

    Work Decomposition

     

    This work requires the development of the feature in Artemis first. Artemis must also provide an API that can be called from WildFly’s messaging-activemq subsystem in order to suspend/resume its Artemis servers.Once the feature is in Artemis, the messaging-activemq subsystem can support it internally (no new management operations will be exposed by the subsystem)

    QE

     

    Tests can be written before the feature is ready as they will rely on existing API (JMS) and operations (shutdown operation).

    Tests will need to identify exceptions corresponding to producing/consuming messages from a suspended server

     

    Tests should focus on validating the use cases:

    • Gracefully shutdown a standalone server
      • Validate that new external connections are rejected
      • Validate that new internal connections are accepted
      • Validate that In-flight transactions can complete
    • Gracefully shutdown a node in a cluster
      • Validate that new external connections are opened on another node
      • Validate that new internal connections are accepted
      • Validate that in-flights transactions can complete
      • Validate that the node is scaled down to another node
    • Suspend/resume a standalone server
      • Suspend the server
      • Validate that new external connections are rejected
      • Validate that new internal connections are accepted
      • Validate that In-flight transactions can complete
      • Validate that existing external connections can not produce/consume messages
      • Resume the server
      • Validate that new external connections are accepted
      • Validate that existing external connections can produce/consume messages
    • Suspend/resume a cluster node
      • Suspend the server
      • Validate that new external connections are opened on another node
      • Validate that internal connections are accepted
      • Validate that In-flight transactions can complete
      • Validate that existing external connections are moved to another node
      • Validate that the node left the cluster
      • Resume the server
      • Validate that new external connections are accepted
      • Validate that the node joined the cluster