Domain Topology and Domain API Versioning Mini-Session -- Brno Jan 13, 2014

Version 1

    Part of the EAP7 / WildFly 9 Developer Meeting in Brno, January 2014.


    This was intended as primarily a small group meeting to go over some technical details, but attendance was large, so there was quite a bit of general discussion around the basic topics.


    • Discussed issue of domain process (Host Controller, server) lifecycle being bound to the parent ProcessController process
      • Binding is due to need for PC to access/manage the stdio streams of the child process
      • Impacts domain-coordinated patching or possibly any sort of "provisioning" activity where the PC needs to be restarted while child processes live on
        • Patch installation, restart PC/HC, then restart servers according to some rollout plan
    • Evaluated use of ProcessBuilder.Redirect as an option for loosening this coupling
      • Direct child process streams to normal, which remain accessible to a future incarnation of the PC and aren't subject to OS buffer issues while the PC is unavailable
      • Decided this was a valid approach that we'll pursue
      • Need to work out the protocol details
    • There was some general discussion re: the overall patching approach vs what might be ideal once the unified platform work is further along. More related to the unified platform discussion.
    • Second main topic was the issue of management API version info as part of the payload of management request (i.e. as a header)
      • [WFLY-1918] Introduce "version" to all operations - JBoss Issue Tracker has details
      • There was opposition (primarily from Tomaz) to the notion of assuming the API version prior to the introduction of versions when a request without versions arrives. The downside to that approach is it requires clients to send API versions to get the latest version. See comments on WFLY-1918. We did not fully resolve this issue.
        • In subsequent discussion with David, we explored the notion that the version for an existing root element/resource can act as the default for any operations involving children.
      • Heiko Braun raised concern that the API versioning system provides license for developers to break compatibility regularly, relying on the internal translation layer that would come with the versioning to ensure compatibility. He had strong doubts this would be reliable, and I agree.
        • In subsequent discussion between David, Tomaz and myself, we noted that the point of the API versions is primarily to handle issues between major releases. So, within a major release, incompatible changes should not be introduced, even if the versioning system would make supporting such changes technically feasible. I believe enforcing this will go far toward addressing Heiko's concern.


    Other discussions during the overall meeting that were not part of this mini-session but are related to the same general topics:


    • Emanuel Muckenhuber, Alexey Loubyansky, Brian Stansberry met to discuss domain-controller-coordinated patching
      • Biggest problem is coordinating the various stages involved in patch application while dealing with the fact that processes (host controllers, process controllers) involved will need to be restarted.
        • When a process is restarted, state associated with where it stands needs to be preserved
          • on disk
          • in a separate special process tasked with assisting in patch coordination
        • There needs to be assurance that in the event of patch failure, the patch can be rolled back. Even if the master HC cannot restart
          • again, perhaps relying on a separate special process tasked with assisting in patch coordination
      • One consideration was, at least initially, requiring rollout on a host-by-host basis, with no rollout approaches that patched some processes on a host but not others
        • Assume a topology of 2 hosts, 2 server groups, 4 servers, 1 server from each group on each host
        • Rolling out by host works for this, as each server group is left with 1 running member. For higher QoS, increase the number of hosts/servers
        • This doesn't work so well for a rolling upgrade scenario where a new server group is created to hold the next-version servers, with separate routing of requests by any LB (so failover occurs within a group) and usually separate state replication groups, with old version servers being shut down as new version servers come on line. Doing rollout on a host-by-host basis would require complete new hosts for the new servers.
          • It could work if part of the patch rollout process was to spawn a new server and to disable an old one
            • New server uses new server group configs
        • Various algorithms related to the above were explored; no real conclusions
          • Other than what we already knew -- this is hard
      • Alexey Loubyansky is interested in working on patching, so that's good new information from the meeting


    • Emanuel Muckenhuber proposed some changes in how managed domain coordination works
      • Extract the coordination work from the standard operation execution workflow.
        • Currently we have
          • A distinct, managed-domain-specific "prepare step handler" which is an OperationStepHandler that replaces the standalone server equivalent with special logic to determine what sort of domain-wide handling is needed for the request (straight local delegation, remote delegation, two-phase rollout to multiple processes)
          • A separate DOMAIN operation execution Stage, that runs after the operation passes VERIFY Stage locally where different OperationStepHandlers deal with the various tasks of propagating the operation to other hosts and servers
        • This logic would no longer be performed by OperationStepHandlers
      • Standard operation execution becomes a single-process concern
        • which may dovetail much more nicely with the work David Lloyd is driving
      • Domain coordination becomes a separate concern within the HC processes that uses the transactional APIs exposed by the standard operation execution to control commit/rollback on each process
        • API endpoints in a host controller interact with this separate layer, not directly with the local process ModelController
      • This should allow cleaner, more maintainable code
      • We will pursue this direction for WF 9 or WF 10 (which depends on other priorities)