Skip navigation
2014

Yesterday I blogged about RHQ-Alerts aka Alerts 2.0 (and who got the Wintermute in the subject?), we have RHQ-Metrics and Mazz wrote about RHQ-Audit and messaging. So let me show you a bigger picture that I have in mind and open it up for discussion.

 

But before I do this, let me first explain some motivation:

 

  • RHQ has some pretty good pieces that we want to open up for more general consumption by other projects. The current incarnation of RHQ itself is pretty monolithic and does not easily allow for separation.
  • Functionality is very synchronous - this is often not that visible because the GWT-UI has all those async callbacks. This synchronicity often makes clients wait on e.g database work and thus consuming resources that can otherwise be used differently (An agent delivering metrics only needs to know that the server has received the data and will process it. It has no advantages from waiting until they are finally stored).
  • With all those new components all bringing new apis (REST, Java), how do we wire them together? And how can we make sure that data that is e.g. fed into RHQ-Metrics will also be available for alerting?
  • How can we (especially for testing purposes) reduce the system complexity and

 

rhq-next-server.png

 

The approach that I am depicting here loosely adheres to the reactive manifesto. It has a central message bus as the backbone and then a set of (remote) APIs feeding into this bus. In above illustration, the dashed box depicts the "server" as opposed to more external things like UI, agents or 3rd-party apps talking to the server.

On the bus we can have several queues and topics for various tasks. The individual components would connect to the bus "via their Java api" (*) to receive messages for processing but also to submit messages on their own for further processing.

  • Incoming metrics need to be processed by rhq-metrics for storage and rhq-alerts for alerting
  • RHQ-metrics could compute aggregates at regular points in time and for one store them and also put them on the bus for  further consumption by rhq-alerts
  • Audit messages would be stored away by RHQ-Audit
  • Alerts could be pushed via Websockets or Errai directly into the UI without the need for polling
  • As written in the RHQ-Alerts post, the CEP engine could create computed availability which is then put on the bus for further processing by a possible RHQ-Availability component

 

Also for scalability a messaging bus allows to e.g. have multiple consumers on a Queue that then process messages in turn. Those consumers can then live on different hosts and thus spreading the load. With multiple consumers on different hosts (and a HA message broker) a certain resilience can be achieved, as if one consumer fails, the other consumers can take over (at reduced speed). This implies that there will be no need to run all the components (rhq-metrics, ... ) in the same VM, but they still can do so in smaller setups.

 

Using a message bus also allows like REST-apis to also use other languages than Java for clients, which is good for further adoption and integration.

 

I briefly mentioned the UI already. For the normal display things each component will provide AngularJS widgets that will need to be combined. Those widgets can still talk to their respective Rest-Endpoints.

 

As you can see this is so far only a very rough design - the box labelled "inventory" is not connected to any of those other boxes and we certainly need to think hard how to connect it and how a future inventory should look like ( for example we need relations between resources that go beyond parent-child, also we should probably introduce the notion of "Application" as a set of such related resources; this would also play into alerting and SLAs).

 

*) What is meant here is that there will be tiny adapters from the bus to the respective java-api of the component in order to make the java api the "standard". I've written about this a few days ago in the light of rhq-metrics.

We have been thinking for a long time how to bring the Alerting from RHQ

  • to a new level
  • to other projects

 

For those who are less familiar with RHQ, I'll quickly describe the alerting possibilities: RHQ allows to alert on incoming metrics from monitoring and to compare the values with thresholds, possibly combining multiple metrics of a managed resource. Similar it is possible to trigger alerts on the outcome of resource operations, matched text in events (snmp traps, logfiles, ..), configuration changes and also changes in availability. Alerting can happen on single resources or group of resources and it is possible to define alerts on a template level, so that freshly added resources are directly enrolled into alerting. RHQ alerts have a few sorts of dampening to suppress duplicates when a problem persists and so on. For more information, you best consult the RHQ wiki.

 

Now while this is already quite powerful, there are numerous possibilities for enhancement like:

 

  • comparing of two metrics with each other (used mem is > than 80% of total mem)
  • comparing of metrics from different resources (if load balance is down or one of my 3 app servers in the cluster)
  • temporary component in reasoning (if night and only 1/3 servers is down send email; if day always send text message ; default: send email and text message and make a phone call )
  • finding an outlier in a group (if the load on one of my 3 boxes in the cluster is higher than on the others).

The RHQ-Wiki has a much bigger list.

 

As we have seen with RHQ-Metrics, there is also a similar need for alerting in other projects, which brings the same questions and thoughts on how to make Alerting available for them as well. Currently the thought go into the direction of also extracting alerting into its own project that can then be again pulled in into the next generation of RHQ and/or be used independently of RHQ.

 

The following graphic shows a possible architecture of this Alerting 2.0.

 

rhq-alert-arch.png

The core is would be a CEP (Complex Event Processing Engine) that offers a Java and REST-api (and perhaps messaging/JMS, feeding into the Java-api). The engine has a rule store, which could be a place in the Cassandra database of RHQ-metrics (each rule being a document) or for example just text files in git.

Modifications of the rules would work via the Engine and its APIs and would also be made available for auditing.

 

As with the current RHQ, there would be pluggable notification senders that do the sending of notification emails, txt messages, snmp traps and so on, just like in existing RHQ; we should in fact investigate how to best keep the existing senders or re-use hem with minimal effort.

 

Like in RHQ-metrics we will create and provide AngularJS directives for Alert-Definition-Editors to be re-used those directives may directly or indirectly also use other angular directives from e.g. RHQ-metrics to show a mini-graph of the metric so that the user can see the past values while defining the alert.

 

The most tricky part is probably a good composition of rules. We would support what we have today, which we could perhaps name "1-level" rules: they follow a "if (x and y) and not dampened then fire" pattern. In the future we would probably have more levels of rules like in this next (simple) diagram:

 

rhq-alert-rules.png

Input events would be correlated and then further processed into an alert and also into (resource) state. Resource state is similar to availability. We could even use this as the computed availability, where the incoming availability report from the resource is not taken literally, but processed by then engine. This allows to define computed availability as down when either the resource can not be pinged or requests take a minute to be processed.

 

A further rule can then take the time of day into account and see if the new state should end up in a notification to be created or not. Similar for any on-duty plans, where the notification is sent to the person on duty and not "randomly" to the whole operations group.

 

Depending on the severity another rule can define escalation handling: if the alert sent is not acknowledged within a certain amount of time, the engine would re-send the alert to the operation person or to a fallback person.

 

Ah and if you ask what this "Wintermute" in the subject is: go and read https://en.wikipedia.org/wiki/Neuromancer :-)