Hawkular Alerts supports scalability/ha out of the box.
You can see an example building the project with
mvn clean install -Denv=cluster
Under the hawkular-alerts-rest-tests/target there will be a cluster based in Wildfly domain of 3 alerting nodes plus 1 node for Cassandra.
I guess with this could be enough to see how a cluster can be formed, basically the main change is to point into the correct ispn configuration.
In the new 2.x branch we are working on we expect to simplify even more these steps and use Hawkular Alerting in a better lightweight manner.
Please, let us know your feedback/plans and we will be happy to help.
Also you can reach us under #hawkular channel on freenode IRC.
That sounds like a great use of Hawkular Alerting. As Lucas said, please keep us updated with your feedback. Also, I've added "Clustering" to the list of tutorial lessons to work on...
Big thanks for the fast answers !
Great news that it should work by design.
What I tried so far for setting up the cluster (2 VMs with Hawkular Alerts instance on different VMs + 1 Cassandra node) was:
- installed a Cassandra instance
- downloaded the hawkular-services-dist-0.36.0.Final.tar.gz release package to two different VMs
- set the parameters in the .hawkular-metrics.properties file "hawkular-metrics.cassandra-nodes=x.x.x.x" ( I configured this one also as we tested with alarms on Metrics also)
- set parameters in the hawkular-alerts.properties files "hawkular-alerts.cassandra-nodes=x.x.x.x" and "hawkular.backend=external"
- and launched hawkular services via standalone.sh on both VMs
This setup was working till that extent that Alerts REST APIs on both VMs were responsive, however triggers - that I previously defined before starting the second server - were not generating alarms any more for some reason.
I ran now the compilation Lucas suggested and found the domain.xsl and host.xsl configuration files in the hawkular-alerts-rest-tests/target subdirectory defining the 3 Hawkular servers.
To be honest I haven't used Wildfly to the extent that I can say that it is now clear for me how should I tune those configuration files from the Hawkular services package.
So I will need to dig into Widlfly HA documentation for more info ...
Are there other important Wildfly configuration files I need to change, besides the above mentioned two?
One additional question would be, do you have suggestion on how to monitor Hawkular Alerts key metrics and healthyness (We planned to use another tool for 'monitoring the monitoring infra') ?
Noticed the status API in Alerts, but I suppose that's not enough to detect all issues.
A couple of things. First, if you just need hawkular alerting you don't need to use the full hawkular-services zip. Instead you could just use standalone alerting, it may be easier for you. You can grab the .war here: https://repository.jboss.org/nexus/service/local/repositories/releases/content/org/hawkular/alerts/hawkular-alerts-rest-standalone/1.7.0.Final/hawkular-alerts-rest-standalone-1.7.0.Final.war.
You can use a vanilla WildFly 10. As an aside, would a docker image be helpful for you?
I think the problem you may have had above is that when clustered you need to use some different configuration, specifically, you don't want to use local Infinispan caches. See here: https://github.com/hawkular/hawkular-alerts/blob/master/hawkular-alerts-rest-tests/src/test/resources/standalone-ha.xsl.
As for supplying some health/stat metrics from hAlerting, that's still on the to-do list. The /status endpoint is useful as a ping mechanism. If you have ideas for metrics you'd like to see exposed, let us know. Some possibilities: datums-recieved, datums-evaluated, events-received, events-evaluated, events-fired, events-stored, num-triggers, num-enabled-triggers, alerts-fired, etc... Also, how would you like them made available? A /metrics endpoint, JMX, etc
I can summarize a couple of steps to simplify the creation of a cluster.
For example, in a CentOS machine I installed and running a shared Cassandra instance using ccm tool:
ccm create -v 3.0.12 -n 1 -s hawkular
I have cloned hawkular-alerts project and build with
mvn clean install
Then I removed the embedded cassandra .war used for demos/testing from the wildfly prepared with the distribution
Next, create two nodes copying the wildfly server
cp -R wildfly-10.0.0.Final node1
cp -R wildfly-10.0.0.Final node2
In a shell we run first node using the standalone-ha.xml (to simplify and not deal with domains)
bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node1
In a different shell we can start the second node with a port-offset
bin/standalone.sh -c standalone-ha.xml -Djboss.node.name=node2 -Djboss.socket.binding.port-offset=150
And then you have a simple 2 nodes cluster conneting to your Cassandra instance.
Assuming your C* is not on localhost or standard ports then you would need to point them to correct one, using system variables before to start the servers
or you can instead use java system properties into the run command
I have tested this in a clean CentOS 7 machine, hope it helps.
As commented, in our 2.x version we are planning to simplify these steps and offer an alternative backend installation, so dependency of cassandra can be optional.
With the last answer I really got this working and started a bit more testing with it.
Is configuring mod_cluster still needed for load balancing or do the nodes know about / interact with each other if started like the above in the same VM ?
If you write how-to guide about clustering it would be also nice to have a multi VM setup covered i.e. information or pointers on information what it takes to get the jboss servers started in standalone-ha mode in two VMs to talk with each other (like ports needed, multicast enabled etc.).
Jay, you asked about docker images : well we intentionally try to avoid running the monitoring services on docker.
(actually we're building this setup to monitor our Openshift cluster )
If let's say we'd be targeting a 1000 events/sec message flow with possibly about 50 to 100 (simple) trigger rules what size of a cluster would you build to handle the event/alarm flow in and out ?
In the above scenario the two nodes should cluster and the alerting work should get distributed automatically. HAlerting leverages Infinispan's distributed caching to do this. I think you would still need a load-balancer if you wanted to automatically distribute incoming HTTP/REST requests to the different nodes. Although this wouldn't be required unless you started to see a bottleneck.
As for your proposed load, I would start with a single node and see how that goes. That ingestion rate seems reasonable for a single node but there are always lots of factors around network/db latency, etc. One note about sending events into alerting; there are three options: evaluate and persist, evaluate only, persist only. So, depending on your need to persist the events you can opt to just discard them after the engine does its evaluation. Similarly, if you want to store events only for later query you can do that too, just leveraging alerting as an event store.
As for docker, yes, that makes sense to keep your monitoring system outside of the env it's monitoring...
Related to the event persistance, if we choose the "evaluate only" option: how does it work with alarm triggers having dampening or complex event processing rules - i.e. rule types that are evaluated over a period of time. Does Hawkular Alerts service buffer those in memory ?
We expect that we will have such alert rules.
All of the evidence for a trigger firing will always be maintained and included in the "evalSets" on the resulting alert or event. So, yes, those will be maintained in memory as dampening is tracked. So, events sent in for "evaluation only" will not be persisted in a way that they could later be returned as part of a query for events. But, any event that contributes to a trigger firing will be held and persisted as part of the resulting alert so that the firing can be explained. For example, a trigger with a single EventCondition and strict-3 dampening would have all 3 contributing events in the evalSets of the resulting alert. In that way you can understand why the trigger fired.