3 Replies Latest reply on Jan 15, 2015 2:02 PM by jayshaughnessy

Availability

jayshaughnessy Jan 8, 2015 3:56 PM

I was thinking about alerting and we had just been talking about alerting on availability changes, like in RHQ. And then I realized that we hadn't really even thought about what availability might look like in Hawkular. There are fundamental differences between Hawkular and RHQ, as well as definite pros and cons of how we did things in the past. After a short while it became clear that things would not really carry over from RHQ. So I started some less public discussions around availability for Hawkular. I've tried to consolidate those conversations and come up with something for more public discussion. Thanks to Heiko, Lukas and Mazz for their input thus far.

This is a rough proposal for how Availability could work in Hawkular. It's totally flexible! Please COMMENT AND ASK QUESTIONS!

Availability in RHQ

RHQ had the following pros and cons with respect to Availability:

Pros:

Scheduled checks performed by an external agent.

The external agent gave us a way to proactively detect avail changes. The schedules gave us predictable intervals for performing the tests and detecting changes.

Cons:

Performance
Timeliness
Untargeted

Our agent used a lot of cycles repeatedly testing availability. In response we were forced to check availability infrequently. This sometimes led to slow reporting or complete misses of "cycled" avail changes. Furthermore, we tracked avail for every resource. That is a lot of wasted work when users are typically interested in a few, critical resources, and often only for alerting purposes.

Availability in Hawkular

Part 1: Don't Require Availability Reporting

Availability is basically a metric. Primarily UP|DOWN. The fundamental difference between avail and other metrics is that a down resource can't report itself down. That needs to come from an external entity. But it can report itself up (like a ping). In fact, it may be able to report its upTime (a more informative ping). But basically, it could be that it doesn't care to report avail at all, and is happy reporting its other metrics (which is another sign of live-ness), or possibly happy reporting none at all.

We need to provide a mechanism to report availability, but we don't need to force it to be used. Let the feed writers decide what is important for their product.

Part 2: Allow for External Availability Reporting

Because a resource can't report itself down, it's important to allow an external source to report down on any resource. This should be easy, it would just be sending a metric report like any other, as long as it has the resource id.

Part 3: Provide a Light External Agent for Reporting Availability

I don't have the specifics here but the main idea is that we have something that can be deployed that can perform frequent, "light" avail checks for specific resources that have been configured. I'm talking about pid detection, url pings (like what the netservices plugin does in RHQ), generic ways to quickly check avail on historically the most important resources, namely servers (processes) and apps. We could provide various, generic mechanisms that require only some seed data (a pid) or simple configuration. The frequency of checks could be fairly high to catch down situations quickly.

An Example:

Let's take a theoretical EAP feed. This feed can not report the EAP server itself DOWN but it can report UP/DOWN avail for any of it's descendants, as it deems fit or has been configured to do. And possibly quite efficiently if somehow tied in with, say, lifecycle functions for its deployments. Perhaps the EAP feed could optionally "register" with the External Agent (if it is available), thus setting up perhaps some ability to watch the process. Or, maybe the External Agent has some level of process discovery, like in RHQ.

So, in this model we have an external agent monitoring the server and the embedded EAP feed doing the rest.

Part 4: Alerting

Alerting (status-based):

So, given that availability is being reported as a sort of metric, it should be possible to perform simple alerting of avail status changes, like GOES_DOWN. And, also given some sort of alerting-side trickery, duration alerting like STAYS_DOWN X Minutes.

Alerting (activity-based)

So what if the external agent is not deployed. In that case a down avail will not be reported, instead, only a series of up avail (like pings) may be reported and then it will stop. There will begin a period of inactivity for the resource. It should be possible to alert on a period of inactivity (no avail metrics received).

Part 5: Persistence

Unlike RHQ and the RDB, using Hawk-Metrics (Cassandra) we want to avoid updates and read-then-write logic. Instead, the general approach is to just write, write, write. So,for avail the prior run-length-encoded storage model will likely be replaced with simple time-availType data points. So, quite possible to have UP, UP, UP... Not reporting avail for every resource will reduce the amount of unnecessary writes. When fetched for a time interval, we can aggregate into a RLE form as desired, by combining consecutive data points of the same avail type.

As in RHQ, if a resource is set to DOWN, its descendents should likely also be set to DOWN.

Other: Caching

If we can implement some in-memory caching we may be able to do some more interesting things. For example, if we cache the current avail we may be able to limit writes to just changes in avail. Maybe more interesting, and not something we do in RHQ, if we have the resource hierarchy cached we may be able to efficiently do the following: when receiving an UP avail for a resource then we can set it's ancestry to UP. In this way, a single UP from a leaf resource can implicitly set avail for several resources higher in the ancestry tree. This has the potential to save a lot of cycles on the feed side. But to do it efficiently we'd likely need both the hierarchy cached as well as the current avail.

Other: Proxy Metric

It may be useful to allow a different metric to "double" as an avail reporter. If metric X is being reported at a decent interval, it could possibly be flagged as an avail ping, basically saying UP in addition to whatever metric value it carries.

Other: Inactivity

For those involved in the avail discussions you'll note we had discussed the idea of GREEN/YELLOW/RED status as periods of inactivity for a resource grew larger. I have decided to get away from this as something we'd ubiquitously track. The amount of work potentially involved in maintaining this as compared to its perceived usefulness didn't seem favorable. Instead, as you saw above, I basically redefined inactivity solely as a concept for alerting, and solely as the lack of an avail metric. This, I think, distills the prior idea to its essence, and should be doable in a much more efficient way.

Diagram:

I stole this from Heiko, it's basically the architecture for the example and discussion above. PAgent would be the light, external agent that performed DOWN avail reporting for the EAP server as a whole, for example. EAgent would be a feed embedded in EAP that reported server-level and descendant metrics and avail. The Server is basically where the avail would be processed/persisted/alerted on.

Summary:

So, there are some fundamental changes in approach here. And some questions to answer. Maybe most importantly, what do people think about not forcing/providing explicit availability on all (or probably the majority) of resources?

Any and all feedback appreciated.

1. Re: Availability

tsegismont Jan 14, 2015 12:44 PM (in response to jayshaughnessy)

The fundamental difference between avail and other metrics is that a down resource can't report itself down.

In RHQ, it was possible that a resource was reported down if it was present but misconfigured. DOWN could also mean "the server is running but one of the health checks reports a problem"

As in RHQ, if a resource is set to DOWN, its descendents should likely also be set to DOWN.

Sorry I'm late in the , what's a resource in hawkular alerts? How is it modeled? Is it a concept applicable even when alerts are not used in conjunction with inventory?

If we can implement some in-memory caching we may be able to do some more interesting things. For example, if we cache the current avail we may be able to limit writes to just changes in avail.
Isn't the way of storing availability an implementation detail of hkmetrics?
Actions
2. Re: Availability

rutlucas Jan 15, 2015 5:29 AM (in response to tsegismont)

Just if it can add some relevant info, for the alerts poc we are working on we are following design described in the thread: Thoughts on RHQ-Alerts aka Alerts 2.0 aka Wintermute
It is high level and there are a lot of things to discuss but I hope that it can help also in the context of this thread.
Actions
3. Re: Availability

jayshaughnessy Jan 15, 2015 2:02 PM (in response to tsegismont)

In RHQ, it was possible that a resource was reported down if it was present but misconfigured. DOWN could also mean "the server is running but one of the health checks reports a problem"

You are right, of course. In RHQ it is totally possible to report yourself DOWN. I was considering only the case where the RHQ Agent's plugin container could not reach (i.e. connect) to the resource, meaning when it was physically DOWN. But if the resource was reachable, and getAvailabity() was invoked, it could return UP or DOWN. And as you said, DOWN could mean that it determined it was functionally DOWN, as opposed to physically unreachable.

The good news is that I think this is still fine in the proposed model. A feed could report any resource as DOWN, including the top level (i.e. server) resource. The external avail agent would report DOWN only on unreachable resources. In other words, the two approaches would still complement each other, one for physically down and the other for functionally down. A proxy metric is also still OK, if configured it acts only like an UP avail metric.

In the example, an running EAP server's embedded feed could report the server itself as UP or DOWN. If the EAP server crashed then only an external agent could report it DOWN.

Sorry I'm late in the , what's a resource in hawkular alerts? How is it modeled? Is it a concept applicable even when alerts are not used in conjunction with inventory?

Good question. A resource is not a concept in H-alerts at the moment and I don't think it will be. I think Alert Triggers will be abstracted away from resources, caring more about IDs without semantics. For example, a metric condition would care about an id for the metric, and a value for the metric, but not necessarily know what the metric is. At some point in the process, say after an alert is generated, it will be necessary to perform intelligent notifications, and that is maybe where we'll need to provide good context about the alert. We still need to think about that.

As for availability, from an alerting perspective we again would care only about the IDs and the values, and durations. And not necessarily about what the avail represents. That would be the divider between Inventory and Alerting. Inventory would know about availability and alerting but alerting would know only about availability, and not care what was actually baing monitored for avail. In that way Avail may actually be a more specific condition for a general mechanism that may simply deal with String metric values.

Isn't the way of storing availability an implementation detail of hkmetrics?

Well, I don't know exactly. I guess H-metrics can define how it wants to store availability and again, leave the semantics to the clients, like Inventory and possibly Alerting. But given that Inventory is likely the primary user, I think it should accommodate it as best as possible and not be the "tail wagging the dog". Having said that, I expect that H-metrics would just want to write avail data points as I described, no? And let the pusher or puller deal with it.
Actions

Go to original post