9 Replies Latest reply on Jun 13, 2017 7:41 AM by jayshaughnessy

Scaling and HA of Hawkular Alerts

mvlasa May 31, 2017 2:45 AM

Hi,

Do you have recommendation on how to build a cluster of Hawkular Alerts components ?

For example Hawkular Metrics user guide shows scalability options for Metrics (i.e. multiple Metrics component running with same Cassandra backend).

How is it possible to achieve this with Hawkular Alerts ?

We thought of using Hawkular Alerts as part of our monitoring solution for a production environment, so scalability and HA are needed.

In brief, we planned to feed-in events through the Alerts event api and send out the alarms via webhook to our monitoring wrapper component.

Thanks in advance,

Miklós Vlasa

1. Re: Scaling and HA of Hawkular Alerts

rutlucas May 31, 2017 4:50 AM (in response to mvlasa)

Hawkular Alerts supports scalability/ha out of the box.
You can see an example building the project with

mvn clean install -Denv=cluster

Under the hawkular-alerts-rest-tests/target there will be a cluster based in Wildfly domain of 3 alerting nodes plus 1 node for Cassandra.
I guess with this could be enough to see how a cluster can be formed, basically the main change is to point into the correct ispn configuration.

In the new 2.x branch we are working on we expect to simplify even more these steps and use Hawkular Alerting in a better lightweight manner.

Please, let us know your feedback/plans and we will be happy to help.
Also you can reach us under #hawkular channel on freenode IRC.

Thanks,
Lucas
Actions
2. Re: Scaling and HA of Hawkular Alerts

jayshaughnessy May 31, 2017 8:07 AM (in response to mvlasa)

That sounds like a great use of Hawkular Alerting. As Lucas said, please keep us updated with your feedback. Also, I've added "Clustering" to the list of tutorial lessons to work on...
Actions
3. Re: Scaling and HA of Hawkular Alerts

mvlasa May 31, 2017 9:28 AM (in response to mvlasa)

Big thanks for the fast answers !

Great news that it should work by design.

What I tried so far for setting up the cluster (2 VMs with Hawkular Alerts instance on different VMs + 1 Cassandra node) was:
- installed a Cassandra instance
- downloaded the hawkular-services-dist-0.36.0.Final.tar.gz release package to two different VMs
- set the parameters in the .hawkular-metrics.properties file "hawkular-metrics.cassandra-nodes=x.x.x.x" ( I configured this one also as we tested with alarms on Metrics also)
- set parameters in the hawkular-alerts.properties files "hawkular-alerts.cassandra-nodes=x.x.x.x" and "hawkular.backend=external"
- and launched hawkular services via standalone.sh on both VMs

This setup was working till that extent that Alerts REST APIs on both VMs were responsive, however triggers - that I previously defined before starting the second server - were not generating alarms any more for some reason.

I ran now the compilation Lucas suggested and found the domain.xsl and host.xsl configuration files in the hawkular-alerts-rest-tests/target subdirectory defining the 3 Hawkular servers.

To be honest I haven't used Wildfly to the extent that I can say that it is now clear for me how should I tune those configuration files from the Hawkular services package.
So I will need to dig into Widlfly HA documentation for more info ...
Are there other important Wildfly configuration files I need to change, besides the above mentioned two?

One additional question would be, do you have suggestion on how to monitor Hawkular Alerts key metrics and healthyness (We planned to use another tool for 'monitoring the monitoring infra') ?
Noticed the status API in Alerts, but I suppose that's not enough to detect all issues.

Thanks,
Miklós
Actions
4. Re: Scaling and HA of Hawkular Alerts

jayshaughnessy May 31, 2017 1:32 PM (in response to mvlasa)

Hi Miklós,

A couple of things. First, if you just need hawkular alerting you don't need to use the full hawkular-services zip. Instead you could just use standalone alerting, it may be easier for you. You can grab the .war here: https://repository.jboss.org/nexus/service/local/repositories/releases/content/org/hawkular/alerts/hawkular-alerts-rest-standalone/1.7.0.Final/hawkular-alerts-rest-standalone-1.7.0.Final.war.

You can use a vanilla WildFly 10. As an aside, would a docker image be helpful for you?

I think the problem you may have had above is that when clustered you need to use some different configuration, specifically, you don't want to use local Infinispan caches. See here: https://github.com/hawkular/hawkular-alerts/blob/master/hawkular-alerts-rest-tests/src/test/resources/standalone-ha.xsl.

As for supplying some health/stat metrics from hAlerting, that's still on the to-do list. The /status endpoint is useful as a ping mechanism. If you have ideas for metrics you'd like to see exposed, let us know. Some possibilities: datums-recieved, datums-evaluated, events-received, events-evaluated, events-fired, events-stored, num-triggers, num-enabled-triggers, alerts-fired, etc... Also, how would you like them made available? A /metrics endpoint, JMX, etc
Actions
5. Re: Scaling and HA of Hawkular Alerts

rutlucas Jun 1, 2017 3:06 AM (in response to mvlasa)

I can summarize a couple of steps to simplify the creation of a cluster.

For example, in a CentOS machine I installed and running a shared Cassandra instance using ccm tool:

ccm create -v 3.0.12 -n 1 -s hawkular

I have cloned hawkular-alerts project and build with

cd hawkular-alerts
mvn clean install

Then I removed the embedded cassandra .war used for demos/testing from the wildfly prepared with the distribution

cd hawkular-alerts-rest-tests/target
rm wildfly-10.0.0.Final/standalone/deployments/hawkular-commons-embedded-cassandra-war.war

Next, create two nodes copying the wildfly server

cp -R wildfly-10.0.0.Final node1
cp -R wildfly-10.0.0.Final node2

In a shell we run first node using the standalone-ha.xml (to simplify and not deal with domains)

cd node1
bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node1

In a different shell we can start the second node with a port-offset

cd node2
bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node2 -Djboss.socket.binding.port-offset=150

And then you have a simple 2 nodes cluster conneting to your Cassandra instance.

Assuming your C* is not on localhost or standard ports then you would need to point them to correct one, using system variables before to start the servers

export CASSANDRA_NODES="cassandra-host"
export CASSANDRA_CQL_PORT="cassandra-port"

or you can instead use java system properties into the run command

-Dhawkular-alerts.cassandra-nodes=cassandra-host -Dhawkular-alerts.cassandra-cql-port=cassandra-port

I have tested this in a clean CentOS 7 machine, hope it helps.

As commented, in our 2.x version we are planning to simplify these steps and offer an alternative backend installation, so dependency of cassandra can be optional.
Actions
6. Re: Scaling and HA of Hawkular Alerts

mvlasa Jun 9, 2017 10:23 AM (in response to mvlasa)

Thanks !
With the last answer I really got this working and started a bit more testing with it.

Is configuring mod_cluster still needed for load balancing or do the nodes know about / interact with each other if started like the above in the same VM ?
If you write how-to guide about clustering it would be also nice to have a multi VM setup covered i.e. information or pointers on information what it takes to get the jboss servers started in standalone-ha mode in two VMs to talk with each other (like ports needed, multicast enabled etc.).

Jay, you asked about docker images : well we intentionally try to avoid running the monitoring services on docker.
(actually we're building this setup to monitor our Openshift cluster )

If let's say we'd be targeting a 1000 events/sec message flow with possibly about 50 to 100 (simple) trigger rules what size of a cluster would you build to handle the event/alarm flow in and out ?
Actions
7. Re: Scaling and HA of Hawkular Alerts

jayshaughnessy Jun 9, 2017 11:29 AM (in response to mvlasa)

In the above scenario the two nodes should cluster and the alerting work should get distributed automatically. HAlerting leverages Infinispan's distributed caching to do this. I think you would still need a load-balancer if you wanted to automatically distribute incoming HTTP/REST requests to the different nodes. Although this wouldn't be required unless you started to see a bottleneck.

As for your proposed load, I would start with a single node and see how that goes. That ingestion rate seems reasonable for a single node but there are always lots of factors around network/db latency, etc. One note about sending events into alerting; there are three options: evaluate and persist, evaluate only, persist only. So, depending on your need to persist the events you can opt to just discard them after the engine does its evaluation. Similarly, if you want to store events only for later query you can do that too, just leveraging alerting as an event store.

As for docker, yes, that makes sense to keep your monitoring system outside of the env it's monitoring...
Actions
8. Re: Scaling and HA of Hawkular Alerts

mvlasa Jun 12, 2017 10:04 AM (in response to jayshaughnessy)

Related to the event persistance, if we choose the "evaluate only" option: how does it work with alarm triggers having dampening or complex event processing rules - i.e. rule types that are evaluated over a period of time. Does Hawkular Alerts service buffer those in memory ?
We expect that we will have such alert rules.
Actions
9. Re: Scaling and HA of Hawkular Alerts

jayshaughnessy Jun 13, 2017 7:41 AM (in response to mvlasa)

All of the evidence for a trigger firing will always be maintained and included in the "evalSets" on the resulting alert or event. So, yes, those will be maintained in memory as dampening is tracked. So, events sent in for "evaluation only" will not be persisted in a way that they could later be returned as part of a query for events. But, any event that contributes to a trigger firing will be held and persisted as part of the resulting alert so that the firing can be explained. For example, a trigger with a single EventCondition and strict-3 dampening would have all 3 contributing events in the evalSets of the resulting alert. In that way you can understand why the trigger fired.
Actions

Go to original post