I guess, once every 1 time means on every occurence.
Try once every two or three times instead.
I think that the rule ""Once every 1 times condition set is true within a time period of 10 minutes" means that as soon as 1 occurence happens on alert is created (and e-mailed) and that it won't do that any more next 10 minutes. What else would be the use to be able to set a time limit?
Do I understand the rule correctly?
mzeijen, I think your interpretation of that is correct. I'll ask Joseph to chime in here, he knows the alert subsystem the best.
In short, "Once every 1 times condition set is true within a time period of 10 minutes" actually has no effect.
Alert dampening rules come in to play when all necessary conditions (under some alert definition) are known to be true. At that point, it would check if we knew the condition set was true *at least* 1 time within the last 10 minutes. Well, by virtue of the fact that we're processing dampening rules, we know it's true *now*, which satisfies the count of 1. Thus, dampening rules with "every 1 times" are actually a no-op.
Alert dampening becomes relevant when that count increases. Let's say you want to alert if some metric crosses the threshold, but you reasonably expect it to spike every so often. You don't want to be alerted during the short spikes (which may happen often), but you do want to be alerted if the metric sustains the value over your threshold. Here is where you might consider using an alert dampening mechanism of "Once every 5 times within a time period of 10 minutes". Each time the condition is true (each time the value is known to be higher than the threshold), it would look back over the last 10 minutes and see if it could find at least 4 other timestamps when it was also true - if so, it fires the alert.
Alert dampening works well in this scenario because of the existence of measurement schedules. Schedules enable us to get a consistent "heartbeat" of metrics over time. This regularly helps add determinism to when the data points will arrive, and how many data points will be in each batch. Events, on the other hand, are an entirely different beast. You might have one event report with only 10 items in it, another with 500. And there's no limit to how many events could match the same regex within each batch.
For your case, I suggest you use Jopr's alert recovery feature. In the section labeled "Action Filters" you'll see an option "Disable alert until re-enabled manually or by recovery alert". If you set this to true, the alert will fire and then automatically disable itself. You then have an unlimited amount of time to respond without fear that the alert will go off again. Once you've resolved the issue, you can navigate to the alert definition and manually re-enable it.
Thanks for the good explanation of the alert dampening rule. I clearly misunderstood what was meant with that rule.
Thanks for the tip that I can set that option "Disable alert until re-enabled manually or by recovery alert". That is close to what I would like but I fear that it could happen that the alert doesn't get re-enabled manually.
If I understand your explanation correctly then I could create the dampening rule "Once every 500 times condition set is true within a time period of 10 minutes". I would get that one alert as soon as an event is reported but I wouldn't be bothered again for at least 10 minutes except if more then 500 events are reported within that 10 minutes.
If that isn't correct then maybe this is a case to create a feature request for? There must be more users that would like such feature.
Actually, there *is* a way to automatically re-enable the alert. Once you create an alert definition "to be recovered", go create another alert definition on the same resource. You'll notice that the "Recovery for" drop-down / combo-box is now enabled. The mechanism works as follows:
* Alert Definition ABC is "to be recovered" (something bad happened in the system)
* Alert Definition XYZ is "recovery" for ABC (system auto-detected problem was resolved)
Initially, ABC will be active and XYZ will be suppressed by the system. If ABC fires an alert (assuming all conditions AND dampening rules have been satisfied), this indicates something bad happened. ABC becomes automatically disabled (because of the action filter), and XYZ is automatically enabled by the system.
XYZ then takes over checking for the "good" condition - the thing that indicates that the system has automatically resolved the problem and is back in steady state once again. This "good" condition could be some other log message, a metric on the resource, etc. Then, when XYZ is triggered and fires an alert (note that XYZ can have it's own set of dampening rules too), it automatically re-enables ABC.
Thus, using the recovery mechanism, you could have a pseudo-automated system. As long as you think the system will return to a steady state eventually, it's possible to use recovery alerts to automate that "baby-sitting" task for you and eliminate some of that manual intervention.
As for your question about the dampening rules, you are mostly correct. If you did "Once every 500 times condition set is true within a time period of 10 minutes" it will *lower* the number of alerts you'll get, but it does not prevent you from getting multiple, dampened alerts within a 10 minute period.
Think of it this way: if you had 4200 instances of the condition set being met within a 10 minute period, you would only get 8 alerts (one alert for each batch of 500 times the condition set was met). It's an over-simplification of how things actually work under-the-covers, but is more or less accurate.
Thanks for all the information. I will certainly be able to put it to good use :).