you can set alerts on metric baselines. not sure if it gets you exactly what you want, but its something to look into.
I forgot to mention that I have looked into using the baselines also, but it did not seem to be functioning as I had hoped either.
I think to accurately trigger alerts on baselines when monitoring averages, 15 minutes in my case, the baselines would have to be calculated continuously every time metrics were received from an agent, every 30 sec for my use case. Also the baseline window would have to be able to be set for 15 minutes and it currently can only go as low as 1 day. This then raises a couple questions I still have about the baselines:
- Is the baseline average the average of the metrics in the time period set by the field 'Baseline Dataset' in the server administration?
- If my assumption in #1 is accurate is it possible to set the 'Baseline Dataset' to 15 minutes and the 'Baseline Calculation Frequency' to 30-60 seconds? If so would this cause significant performance issues for a server monitoring ~150 agents even in a clustered blade environment?
Actually there is a way in RHQ (> 4.3 ?) to achieve what you want, but with a little bit of external tooling (e.g. a shell script)
What you would need to do is (in a loop)
- via rest api obtain the metrics for the last 15 mins (via the raw-data endpoint)
- calculate the baselines as you want them
- write them back for the schedule of the metric
- sleep some time
And then jsut use the alerting where the data point is x% above/below the baseline.
And if it works, write a blog post :-)
Setting alerts on baselines like that will be triggered if one datapoint goes over x% of the specified baseline value right?
For my situation I want to throw alerts when the average value exceeds a set threshold value. Using REST to update the baselines as described I think it would require that the alert condition was like the conditions for metrics so that: you could define a threshold value, select a comparator (<,>,=), then select which value from the baseline (min, max, or avg) to compare to. Then manipulating the baselines should work.
Yes go to alert definitions and then add a condition on
metric baseline threshold
In the popup you can then select the metric to compare (with its baseline),
the comparator, which can be <,=,>
the "exceeds baseline" factor in %
and the reference entriy of the baseline (avg, min, max)
if you want to alert on "begin outside the band", you need to add two conditions, one comparing with > and the other with < and then use the "fire alert if ANY of the conditions matches" case
What I was describing with the REST interface was only how to compute the baselines with an external job. The alerting itself will stay as is (i.e. with the internal computation mechanism).
But an alert definition configured like that would be triggered if one datapoint met the condition correct?
Where as I am looking for a solution to alert against the baseline average when the baseline average (that I set through REST) exceeds a set threshold value.
The reason I want to alert against an average of the datapoints over 15 minutes is to smooth out spikes in the metrics, and the current baseline implementation for alerts looks like it would still be triggered by a single high datapoint.
You can do something like, if a measurement exceeds N consecutive times in a row, then trigger an alarm. (I don't know what 'consecutive' means, though. Does it mean every time the measurement is scheduled to be taken or something else?)
This is like checking an average.
Yes, I agree with Elias, it sounds like you could use dampening against the raw metrics being reported to make the alerting tolerant of spikes. The dampening feature is exactly for this purpose.