There has not been a lot of discussion about a multi-node architecture for RHQ Metrics, so I thought it would be good to start a discussion to share some ideas and to solicit some feedback. Unless stated otherwise, the term node refers to an instance of RHQ Metrics for the purposes of this discussion.
First, let's consider why a multi-node architecture is important. By running multiple nodes we can provide high availability. If a particular node goes down, other nodes can continue servicing requests. Provided we are not maintaining client state on the server, offering HA should just be a matter of running multiple nodes.
Load balancing is another concern. If a single node cannot handle the current load, then we can deploy additional nodes and distribute the load amongst them. Load balancing may be provided by some service other than RHQ Metrics.
Coordination between nodes is another, big concern. One of the use cases that demonstrates the need for coordination is generating pre-computed aggregates. We need to determine what node(s) perform what calculations and when. We also need to be resilient in the event of failures. If a node goes down before completing its tasks, then we may want another node to finish the remainder of that work.
Discovery might also be another important consideration. When deploying additional nodes, they might need the ability to discover the other nodes in the cluster for purposes of load balancing and/or coordination.
When thinking about possible solutions for things like load balancing, coordination, discovery, etc., we need to also think about the different deployment scenarios for RHQ Metrics. In discussions of management.next architecture, a message bus is one of the central components. A message bus might provide a lot of the components necessary to implements the features being discussed. There may be scenarios in which RHQ Metrics is deployed on its own, separately from the management.next stack. In those situations, the message bus might not be available to RHQ Metrics. Embedding RHQ Metrics into WildFly might be an example of when the full management.next stack may not be available. Understanding the different deployment scenarios is critical in order effectively design for a multi-node architecture.