8 Replies Latest reply on Jan 15, 2015 4:41 PM by john.sanda

Thoughts on multi-node architecture

john.sanda Jan 12, 2015 10:51 AM

There has not been a lot of discussion about a multi-node architecture for RHQ Metrics, so I thought it would be good to start a discussion to share some ideas and to solicit some feedback. Unless stated otherwise, the term node refers to an instance of RHQ Metrics for the purposes of this discussion.

First, let's consider why a multi-node architecture is important. By running multiple nodes we can provide high availability. If a particular node goes down, other nodes can continue servicing requests. Provided we are not maintaining client state on the server, offering HA should just be a matter of running multiple nodes.

Load balancing is another concern. If a single node cannot handle the current load, then we can deploy additional nodes and distribute the load amongst them. Load balancing may be provided by some service other than RHQ Metrics.

Coordination between nodes is another, big concern. One of the use cases that demonstrates the need for coordination is generating pre-computed aggregates. We need to determine what node(s) perform what calculations and when. We also need to be resilient in the event of failures. If a node goes down before completing its tasks, then we may want another node to finish the remainder of that work.

Discovery might also be another important consideration. When deploying additional nodes, they might need the ability to discover the other nodes in the cluster for purposes of load balancing and/or coordination.

When thinking about possible solutions for things like load balancing, coordination, discovery, etc., we need to also think about the different deployment scenarios for RHQ Metrics. In discussions of management.next architecture, a message bus is one of the central components. A message bus might provide a lot of the components necessary to implements the features being discussed. There may be scenarios in which RHQ Metrics is deployed on its own, separately from the management.next stack. In those situations, the message bus might not be available to RHQ Metrics. Embedding RHQ Metrics into WildFly might be an example of when the full management.next stack may not be available. Understanding the different deployment scenarios is critical in order effectively design for a multi-node architecture.

1. Re: Thoughts on multi-node architecture

juraci.costa Jan 12, 2015 11:28 AM (in response to john.sanda)

John Sanda wrote:

Load balancing is another concern. If a single node cannot handle the current load, then we can deploy additional nodes and distribute the load amongst them. Load balancing may be provided by some service other than RHQ Metrics.

Discovery might also be another important consideration. When deploying additional nodes, they might need the ability to discover the other nodes in the cluster for purposes of load balancing and/or coordination.

I understand this as "a load balancer can either be external or internal, at the discretion of whoever is setting it up". If so, "discovery of new nodes" is not relevant for RHQ Metrics for load-balancing purposes, right? About "coordination": I don't have a complete picture of the whole project, so, I might be missing something *big* here, but was Infinispan considered as the main data grid backend for the project? It provides a distributed map/reduce that could be used for the use-case you described ("generating pre-computed aggregates"). Cassandra can still be used as the cache store for Infinispan, if needed/required.
Actions
2. Re: Thoughts on multi-node architecture

juraci.costa Jan 12, 2015 11:34 AM (in response to juraci.costa)

Nevermind the Infinispan part :-) I'm just reading through older posts, and it's clear that direct Cassandra access is the desirable option here.
Actions
3. Re: Thoughts on multi-node architecture

tsegismont Jan 14, 2015 5:16 AM (in response to john.sanda)

Load balancing is another concern. If a single node cannot handle the
current load, then we can deploy additional nodes and distribute the
load amongst them. Load balancing may be provided by some service other
than RHQ Metrics.

Why "may"? I would think that load balancing is an external concern
(BigIP/HAProxy/nginx/Apache/... etc)

Coordination between nodes is another, big concern. One of the use cases
that demonstrates the need for coordination is generating pre-computed
aggregates. We need to determine what node(s) perform what calculations
and when. We also need to be resilient in the event of failures. If a
node goes down before completing its tasks, then we may want another
node to finish the remainder of that work.

Pre-computed aggregates are for the C* backend only, correct? In this
case, how about using it for leader election?

http://www.datastax.com/dev/blog/consensus-on-cassandra

I don't known if the current algorithm allows for it, but computing
aggregates in parallel on all the Hawkular metric nodes available would
be awesome.

Discovery might also be another important consideration. When deploying
additional nodes, they might need the ability to discover the other
nodes in the cluster for purposes of load balancing and/or coordination.

Load balancing put aside, why do you think we'd need discovery of other
Hawkular metrics nodes?

When thinking about possible solutions for things like load balancing,
coordination, discovery, etc., we need to also think about the different
deployment scenarios for RHQ Metrics. In discussions of management.next
architecture, a message bus is one of the central components. A message
bus might provide a lot of the components necessary to implements the
features being discussed. There may be scenarios in which RHQ Metrics is
deployed on its own, separately from the management.next stack. In those
situations, the message bus might not be available to RHQ Metrics.
Embedding RHQ Metrics into WildFly might be an example of when the full
management.next stack may not be available. Understanding the different
deployment scenarios is critical in order effectively design for a
multi-node architecture.

Wouldn't the consensus on C* method work in any scenario?
Actions
4. Re: Thoughts on multi-node architecture

john.sanda Jan 14, 2015 12:37 PM (in response to tsegismont)

Load balancing is another concern. If a single node cannot handle the

current load, then we can deploy additional nodes and distribute the

load amongst them. Load balancing may be provided by some service other

than RHQ Metrics.

Why "may"? I would think that load balancing is an external concern

(BigIP/HAProxy/nginx/Apache/... etc)

I definitely agree that its an external concern when we are dealing clients using the REST API. Other clients, i.e., collectors/agents, might not use HTTP. They might push metric data via the message bus, and for that I think we might need to explore some other options.

Pre-computed aggregates are for the C* backend only, correct? In this

case, how about using it for leader election?

http://www.datastax.com/dev/blog/consensus-on-cassandra

I don't known if the current algorithm allows for it, but computing

aggregates in parallel on all the Hawkular metric nodes available would

be awesome.

We definitely need to explore implementing leader election directly with Cassandra. There are other use cases for coordination such as changing data retention and deleting a metric. I suspect that there will be more use cases. I think that an implementation based solely on Cassandra will require a lot of polling. If the message bus is available to, we could eliminate a lot of that polling.

I have also been investigating Spark. I do not know if something like Spark is an option, but I think it definitely merits the investigation. The Spark streaming library offers a very elegant way to calculate/update the pre-computed aggregates in real time with very little code.
Actions
5. Re: Thoughts on multi-node architecture

pilhuhn Jan 14, 2015 4:13 PM (in response to tsegismont)

I would like to keep the pre-computed aggregates out of the picture for now, as I think that it needs a lot more thinking.

I think one could assume external load balancing. For http this is just some (reverse) proxy in front, while for the bus, the Queue semantic could guarantee, that a message is only delivered once(*) so that a node can pick a message and work on it, the next message would be consumed by the next available node.
But even if we do not use that, (JMS-)selectors could be used so that one node only gets a certain tenant pattern and the other node gets the rest

*) So far I had been thinking of using a Topic for metric-messages on the bus so that metrics can be at the same time be processed by rhq-metrics and alerts.
Actions
6. Re: Thoughts on multi-node architecture

nstefan Jan 14, 2015 4:49 PM (in response to pilhuhn)

Heiko Rupp wrote:

I would like to keep the pre-computed aggregates out of the picture for now, as I think that it needs a lot more thinking.

The primary reason for this topic is to cover balancing of additional work aside from storing metrics. A Cassandra cluster takes of balancing the storage; that is the main reason we chose Cassandra. So if aggregation is not a concern and will not be for the foreseeable future, then this discussion is mute. We would NOT need anything additional for load balancing in RHQ Metrics. Just point RHQ Metrics to a Cassandra cluster and done! Cassandra will do its magic from there.

But there was discussion back in October (Do we need pre-computed aggregates?) that established the need for aggregates. I am confused ....
Actions
7. Re: Thoughts on multi-node architecture

heiko.braun Jan 15, 2015 4:22 AM (in response to nstefan)

I agree with Stefan: Aggregation and retention are the concerns we should discuss before looking at load balancing. Or more generally speaking: Is there a need for coordination amongst nodes? In what cases? Does it imply certain constraints?

I think we can ignore the work performed by cassandra and the coordination on this level. This is sufficiently covered and well documented. The parts that matter to us are the nodes that run business logic.
Actions
8. Re: Thoughts on multi-node architecture

john.sanda Jan 15, 2015 4:41 PM (in response to heiko.braun)
Heiko Braun wrote:

I agree with Stefan: Aggregation and retention are the concerns we should discuss before looking at load balancing. Or more generally speaking: Is there a need for coordination amongst nodes? In what cases? Does it imply certain constraints?

I think we can ignore the work performed by cassandra and the coordination on this level. This is sufficiently covered and well documented. The parts that matter to us are the nodes that run business logic.

Yes, we do need coordination. For aggregation, we need to determine which nodes do what work. Whether we distribute the work among all the nodes, or delegate to a single node, coordination is needed. We also need it for failure scenarios. Suppose a node goes down while performing some reoccurring or background task. There needs to be some form of coordination to ensure that the task does in fact get completed.

There are three situations that I can think of which will require coordination (there will likely be more)

aggregation
changing data retention
deleting metrics

I think it is too early to say about constraints, but we obviously want to keep them to a minimum. Ideally, users should get the same functionality whether they run a single or multiple nodes.
Actions

Go to original post