3 Replies Latest reply on Apr 12, 2013 1:26 PM by jayshaughnessy

Can you fire recovery alerts multiple times for one event?

josepho Apr 11, 2013 1:07 PM

I am trying to create complete alert events with: a start alert, subsequent alerts every metric cycle if the condition still exists, and a end alert when the condition is corrected.

I have been able to setup a recovery alert to act as the originating alert to monitor a measurement and generate an alert when the measurement goes out of it bounds. Then I set another alert that starts out disabled and is recovered by the recovery alert that watches for the measurement to go back into its desired range, then disables itself so it does not keep generating alerts that the condition is normal. This method is accomplishing generating a start and stop alert for a condition, but the recovery alert only generates one alert when the condition starts. Is it possible to have the recovery alert that is reporting the abnormal condition keep generating alerts as long as the condition exists so there is a history of the condition generated?

I also need to be able to associate the alerts to each other using the REST interface, I am currently do so using the alertDefinition ids and the recoveryId.

Thanks,

Joseph

1. Re: Can you fire recovery alerts multiple times for one event?

jayshaughnessy Apr 11, 2013 3:59 PM (in response to josepho)

I'm not sure why you need a recovery alert. It sounds like you just need a single Alert Definition. By default (with no special dampening) it will fire each time the metric is reported in the problem range. If the metric is reported with a normal value it will stop firing.

In general recovery alerts are used when you want an alert to fire once and not again until the situation is corrected.. In that case you set the standard alert to disable after firing, and you associate a recovery alert definition. The recovery alert just serves to re-enable the alert when the situation has been corrected.

I'm not sure about the REST interface, it may not yet support what you are looking for.
Actions
2. Re: Can you fire recovery alerts multiple times for one event?

josepho Apr 11, 2013 4:35 PM (in response to jayshaughnessy)

The reason for the recovery alert is to try and replicate an existing monitoring system RHQ is replacing where an alert event has a start and end alert. The existing system generates a 'cleared alert' to signal to the user a condition has ended or been cleared.

I was wanting to also add RHQ's ability to generate alerts every time the metric is reported to provide the end user with a history of the alert event so trends could be analyzed. Is type of combined functionality possible with RHQ?

Thanks,
Joseph
Actions
3. Re: Can you fire recovery alerts multiple times for one event?

jayshaughnessy Apr 12, 2013 1:26 PM (in response to josepho)

OK, I think I get what you are saying. You are looking to record an "Alert Event" in some way. Something that somehow demarcates the beginnning and end of related alerts for a particular issue. Is that right?

So what you are looking for is sort of a way to fire a "Problem Start" alert, then 0 or more "Problem Not Solved" alerts, then a "Problem Solved" alert.

We don't really have that concept, recovery alerts don't do quite what you want. You need a little more manual contol over what is going on.   The only thing I can think of that could help you out would be to incorporate Alert Notification scripts.

For example, and this is just off the top of my head. Say you want to do this for Resource R where metric M > value V. For R, create 3 alerts defs:

AD-1: PROBLEM FIXED!              Condition: M <= V
This will be the recovery alert for AD-2

AD-2: PROBLEM START!             Condition:   M > V
   Set AD-1 as recovery alert

AD-3: PROBLEM NOT SOLVED!   Condition:   M > V
   Initially Enabled=False

Now, on AD-1 and AD-2 you also have Alert Notification scripts:
- For AD-1 it disables AD-3 (using the AlertDefinitionManagerRemote method to do so).
- For AD-2 it enables AD-3 (using the AlertDefinitionManagerRemote method to do so).

So then, if M exceeds V AD-2 will fire. It will then disable and wait for the recovery alert AD-1 to fire. This is the standard recovery alert feature. But also, AD-2 will enable AD-3 via the script. This will keep firing until the problem is fixed and AD-1 fires. It will execute the script to disable AD-3.

Or, at least I think it would behave that way.

As for grouping these alerts in a report. There is no way to link alerts together by default. You may be able to come up with some sort of script to figure it out, based on naming and such. See AlertManagerRemote.findAlertsByCriteria() for various query options.
Actions

Go to original post