8 Replies Latest reply on Jan 15, 2015 4:41 PM by john.sanda

    Thoughts on multi-node architecture

    john.sanda

      There has not been a lot of discussion about a multi-node architecture for RHQ Metrics, so I thought it would be good to start a discussion to share some ideas and to solicit some feedback. Unless stated otherwise, the term node refers to an instance of RHQ Metrics for the purposes of this discussion.

       

      First, let's consider why a multi-node architecture is important. By running multiple nodes we can provide high availability. If a particular node goes down, other nodes can continue servicing requests. Provided we are not maintaining client state on the server, offering HA should just be a matter of running multiple nodes.

       

      Load balancing is another concern. If a single node cannot handle the current load, then we can deploy additional nodes and distribute the load amongst them. Load balancing may be provided by some service other than RHQ Metrics.

       

      Coordination between nodes is another, big concern. One of the use cases that demonstrates the need for coordination is generating pre-computed aggregates. We need to determine what node(s) perform what calculations and when. We also need to be resilient in the event of failures. If a node goes down before completing its tasks, then we may want another node to finish the remainder of that work.

       

      Discovery might also be another important consideration. When deploying additional nodes, they might need the ability to discover the other nodes in the cluster for purposes of load balancing and/or coordination.

       

      When thinking about possible solutions for things like load balancing, coordination, discovery, etc., we need to also think about the different deployment scenarios for RHQ Metrics. In discussions of management.next architecture, a message bus is one of the central components. A message bus might provide a lot of the components necessary to implements the features being discussed. There may be scenarios in which RHQ Metrics is deployed on its own, separately from the management.next stack. In those situations, the message bus might not be available to RHQ Metrics. Embedding RHQ Metrics into WildFly might be an example of when the full management.next stack may not be available. Understanding the different deployment scenarios is critical in order effectively design for a multi-node architecture.

        • 1. Re: Thoughts on multi-node architecture
          juraci.costa

          John Sanda wrote:

           

          Load balancing is another concern. If a single node cannot handle the current load, then we can deploy additional nodes and distribute the load amongst them. Load balancing may be provided by some service other than RHQ Metrics.

           

          Discovery might also be another important consideration. When deploying additional nodes, they might need the ability to discover the other nodes in the cluster for purposes of load balancing and/or coordination.

           

          I understand this as "a load balancer can either be external or internal, at the discretion of whoever is setting it up". If so, "discovery of new nodes" is not relevant for RHQ Metrics for load-balancing purposes, right? About "coordination": I don't have a complete picture of the whole project, so, I might be missing something *big* here, but was Infinispan considered as the main data grid backend for the project? It provides a distributed map/reduce that could be used for the use-case you described ("generating pre-computed aggregates"). Cassandra can still be used as the cache store for Infinispan, if needed/required.

          • 2. Re: Thoughts on multi-node architecture
            juraci.costa

            Nevermind the Infinispan part :-) I'm just reading through older posts, and it's clear that direct Cassandra access is the desirable option here.

            • 3. Re: Thoughts on multi-node architecture
              tsegismont

              Load balancing is another concern. If a single node cannot handle the

              current load, then we can deploy additional nodes and distribute the

              load amongst them. Load balancing may be provided by some service other

              than RHQ Metrics.

               

              Why "may"? I would think that load balancing is an external concern

              (BigIP/HAProxy/nginx/Apache/... etc)

               

               

              Coordination between nodes is another, big concern. One of the use cases

              that demonstrates the need for coordination is generating pre-computed

              aggregates. We need to determine what node(s) perform what calculations

              and when. We also need to be resilient in the event of failures. If a

              node goes down before completing its tasks, then we may want another

              node to finish the remainder of that work.

               

               

              Pre-computed aggregates are for the C* backend only, correct? In this

              case, how about using it for leader election?

               

              http://www.datastax.com/dev/blog/consensus-on-cassandra

               

              I don't known if the current algorithm allows for it, but computing

              aggregates in parallel on all the Hawkular metric nodes available would

              be awesome.

               

              Discovery might also be another important consideration. When deploying

              additional nodes, they might need the ability to discover the other

              nodes in the cluster for purposes of load balancing and/or coordination.

               

               

              Load balancing put aside, why do you think we'd need discovery of other

              Hawkular metrics nodes?

               

              When thinking about possible solutions for things like load balancing,

              coordination, discovery, etc., we need to also think about the different

              deployment scenarios for RHQ Metrics. In discussions of management.next

              architecture, a message bus is one of the central components. A message

              bus might provide a lot of the components necessary to implements the

              features being discussed. There may be scenarios in which RHQ Metrics is

              deployed on its own, separately from the management.next stack. In those

              situations, the message bus might not be available to RHQ Metrics.

              Embedding RHQ Metrics into WildFly might be an example of when the full

              management.next stack may not be available. Understanding the different

              deployment scenarios is critical in order effectively design for a

              multi-node architecture.

               

              Wouldn't the consensus on C* method work in any scenario?

              • 4. Re: Thoughts on multi-node architecture
                john.sanda

                Load balancing is another concern. If a single node cannot handle the

                current load, then we can deploy additional nodes and distribute the

                load amongst them. Load balancing may be provided by some service other

                than RHQ Metrics.

                 

                Why "may"? I would think that load balancing is an external concern

                (BigIP/HAProxy/nginx/Apache/... etc)

                I definitely agree that its an external concern when we are dealing clients using the REST API. Other clients, i.e., collectors/agents, might not use HTTP. They might push metric data via the message bus, and for that I think we might need to explore some other options.

                 

                Pre-computed aggregates are for the C* backend only, correct? In this

                case, how about using it for leader election?

                 

                http://www.datastax.com/dev/blog/consensus-on-cassandra

                 

                I don't known if the current algorithm allows for it, but computing

                aggregates in parallel on all the Hawkular metric nodes available would

                be awesome.

                We definitely need to explore implementing leader election directly with Cassandra. There are other use cases for coordination such as changing data retention and deleting a metric. I suspect that there will be more use cases. I think that an implementation based solely on Cassandra will require a lot of polling. If the message bus is available to, we could eliminate a lot of that polling.

                 

                I have also been investigating Spark. I do not know if something like Spark is an option, but I think it definitely merits the investigation. The Spark streaming library offers a very elegant way to calculate/update the pre-computed aggregates in real time with very little code.

                • 5. Re: Thoughts on multi-node architecture
                  pilhuhn

                  I would like to keep the pre-computed aggregates out of the picture for now, as I think that it needs a lot more thinking.

                   

                  I think one could assume external load balancing. For http this is just some (reverse) proxy in front, while for the bus, the Queue semantic could guarantee, that a message is only delivered once(*) so that a node can pick a message and work on it, the next message would be consumed by the next available node.

                  But even if we do not use that, (JMS-)selectors could be used so that one node only gets a certain tenant pattern and the other node gets the rest

                   

                  *) So far I had been thinking of using a Topic for metric-messages on the bus so that metrics can be at the same time be processed by rhq-metrics and alerts.

                  • 6. Re: Thoughts on multi-node architecture
                    nstefan

                    Heiko Rupp wrote:

                     

                    I would like to keep the pre-computed aggregates out of the picture for now, as I think that it needs a lot more thinking.

                     

                    The primary reason for this topic is to cover balancing of additional work aside from storing metrics. A Cassandra cluster takes of balancing the storage; that is the main reason we chose Cassandra. So if aggregation is not a concern and will not be for the foreseeable future, then this discussion is mute. We would NOT need anything additional for load balancing in RHQ Metrics. Just point RHQ Metrics to a Cassandra cluster and done! Cassandra will do its magic from there.

                     

                    But there was discussion back in October (Do we need pre-computed aggregates?) that established the need for aggregates. I am confused ....

                    • 7. Re: Thoughts on multi-node architecture
                      heiko.braun

                      I agree with Stefan: Aggregation and retention are the concerns we should discuss before looking at load balancing. Or more generally speaking: Is there a need for coordination amongst nodes? In what cases? Does it imply certain constraints?

                       

                      I think we can ignore the work performed by cassandra and the coordination on this level. This is sufficiently covered and well documented.  The parts that matter to us are the nodes that run business logic.

                      • 8. Re: Thoughts on multi-node architecture
                        john.sanda

                        Heiko Braun wrote:

                         

                        I agree with Stefan: Aggregation and retention are the concerns we should discuss before looking at load balancing. Or more generally speaking: Is there a need for coordination amongst nodes? In what cases? Does it imply certain constraints?

                         

                        I think we can ignore the work performed by cassandra and the coordination on this level. This is sufficiently covered and well documented.  The parts that matter to us are the nodes that run business logic.

                        Yes, we do need coordination. For aggregation, we need to determine which nodes do what work. Whether we distribute the work among all the nodes, or delegate to a single node, coordination is needed. We also need it for failure scenarios. Suppose a node goes down while performing some reoccurring or background task. There needs to be some form of coordination to ensure that the task does in fact get completed.

                         

                        There are three situations that I can think of which will require coordination (there will likely be more)

                         

                        1. aggregation
                        2. changing data retention
                        3. deleting metrics

                         

                        I think it is too early to say about constraints, but we obviously want to keep them to a minimum. Ideally, users should get the same functionality whether they run a single or multiple nodes.