8 Replies Latest reply on Nov 7, 2012 9:38 AM by jayshaughnessy

Can alert conditions be created for metric averages?

josepho Oct 17, 2012 12:58 PM

Hi,

I am trying to create alerts that monitor metric averages for x minutes/hours/days. I have tried using the dampening conditions to achieve a solution but none of the conditions can accurately account for averages due to the following reasons:

The meric average can exceed the threshold without all the measuremnets meeting the threshhold in order to satisfy the 'Consecutive' condition, and I want the first occourace so the 'Last N Evaluations' condition is equivalent to 'Consecutive'.
Both 'Time Period' and 'Consecutive' will not catch every instance because the agents do not have to send the metrics at every interval if the agent is under a high load so checking within a set number of measurement intervals within any set time frame can not be guaranteed to be accurate unless the alert definition is flexible enough watch the current average for the set time period.

From what I can find it looks like the solution would likely be to add a condition to the alert system to monitor averages for a set time period, but it also appears that RHQ does not store the averages displayed in the GUI or REST so this may also require changing the schema to hold these values. Can someone give any insight on a simplier solution that alerts could be set to accuratly monitor an average for metrics over a custom time period?

Thanks,

Joseph

1. Re: Can alert conditions be created for metric averages?

mazz Oct 17, 2012 1:14 PM (in response to josepho)

you can set alerts on metric baselines. not sure if it gets you exactly what you want, but its something to look into.
Actions
2. Re: Can alert conditions be created for metric averages?

josepho Oct 17, 2012 1:52 PM (in response to mazz)
I forgot to mention that I have looked into using the baselines also, but it did not seem to be functioning as I had hoped either.

I think to accurately trigger alerts on baselines when monitoring averages, 15 minutes in my case, the baselines would have to be calculated continuously every time metrics were received from an agent, every 30 sec for my use case. Also the baseline window would have to be able to be set for 15 minutes and it currently can only go as low as 1 day. This then raises a couple questions I still have about the baselines:
Is the baseline average the average of the metrics in the time period set by the field 'Baseline Dataset' in the server administration?
If my assumption in #1 is accurate is it possible to set the 'Baseline Dataset' to 15 minutes and the 'Baseline Calculation Frequency' to 30-60 seconds? If so would this cause significant performance issues for a server monitoring ~150 agents even in a clustered blade environment?

Thanks,
Joseph
Actions
3. Re: Can alert conditions be created for metric averages?

pilhuhn Oct 18, 2012 6:07 AM (in response to josepho)

Actually there is a way in RHQ (> 4.3 ?) to achieve what you want, but with a little bit of external tooling (e.g. a shell script)

Have a look at http://pilhuhn.blogspot.de/2012/01/pushing-metrics-baselines-via-rest.html

What you would need to do is (in a loop)

- via rest api obtain the metrics for the last 15 mins (via the raw-data endpoint)
- calculate the baselines as you want them
- write them back for the schedule of the metric
- sleep some time

And then jsut use the alerting where the data point is x% above/below the baseline.

And if it works, write a blog post :-)
Actions
4. Re: Can alert conditions be created for metric averages?

josepho Oct 18, 2012 10:39 AM (in response to pilhuhn)

Setting alerts on baselines like that will be triggered if one datapoint goes over x% of the specified baseline value right?

For my situation I want to throw alerts when the average value exceeds a set threshold value. Using REST to update the baselines as described I think it would require that the alert condition was like the conditions for metrics so that: you could define a threshold value, select a comparator (<,>,=), then select which value from the baseline (min, max, or avg) to compare to. Then manipulating the baselines should work.
Actions
5. Re: Can alert conditions be created for metric averages?

pilhuhn Oct 18, 2012 10:58 AM (in response to josepho)

Yes go to alert definitions and then add a condition on
metric baseline threshold

In the popup you can then select the metric to compare (with its baseline),
the comparator, which can be <,=,>
the "exceeds baseline" factor in %
and the reference entriy of the baseline (avg, min, max)

if you want to alert on "begin outside the band", you need to add two conditions, one comparing with > and the other with < and then use the "fire alert if ANY of the conditions matches" case

What I was describing with the REST interface was only how to compute the baselines with an external job. The alerting itself will stay as is (i.e. with the internal computation mechanism).
Actions
6. Re: Can alert conditions be created for metric averages?

josepho Oct 18, 2012 1:18 PM (in response to pilhuhn)

But an alert definition configured like that would be triggered if one datapoint met the condition correct?
Where as I am looking for a solution to alert against the baseline average when the baseline average (that I set through REST) exceeds a set threshold value.

The reason I want to alert against an average of the datapoints over 15 minutes is to smooth out spikes in the metrics, and the current baseline implementation for alerts looks like it would still be triggered by a single high datapoint.
Actions
7. Re: Can alert conditions be created for metric averages?

genman Oct 30, 2012 4:28 PM (in response to josepho)

You can do something like, if a measurement exceeds N consecutive times in a row, then trigger an alarm. (I don't know what 'consecutive' means, though. Does it mean every time the measurement is scheduled to be taken or something else?)

This is like checking an average.
Actions
8. Re: Can alert conditions be created for metric averages?

jayshaughnessy Nov 7, 2012 9:38 AM (in response to genman)

Yes, I agree with Elias, it sounds like you could use dampening against the raw metrics being reported to make the alerting tolerant of spikes. The dampening feature is exactly for this purpose.
Actions

Go to original post