Thoughts on Inventory for RHQ.next

Posted by pilhuhn Sep 25, 2014

[ We already had some discussions in the last few days, but also in the past - have a look at the "Relationship service", "Dependent Resources" and also "Design agentless management" in the RHQ wiki. ]

A resource in the following is regarded in the RHQ way as a managed entity. This may be represented by 0..n resources in other systems.

RHQ

Current RHQ uses a hierarchical inventory model which is basically a forrest with trees that have resources of category "platform" (= host) as roots.

Each agent that is running on a host to manage resources on it, maps to such a platform resource.

A resource is defined by a static resource type inside a plugin. The resource type defines the capabilities of a resource via facets (metrics, operations, ...).

This obviously has its drawbacks:

RHQ assumes in many places that an agent is managing the resource, neglecting the fact that resources could also be created via the REST-api without any agent.
While in the model, there is no real way of expressing a host with many virtual guests that are in fact also platforms. The host + the guests can all appear in the forrest, but there is no way to express that if the host goes down, all guests also go down.
The hierarchical model does not allow to specify things that logically belong together, but which may have different parents.
Take the example of an application that consist of a load balancer, three application servers and a database.
In this example the database is also shared with another application.
Resources are relatively static. The developer needs to know at development time which metrics or operations a resource will have. This does not allow to just import an arbitrary MBean or SNMP-entity as the names and numbers of properties are not known at plugin design time.

A strong trait of the current model is the existence of metadata, that allows to describe a resource and its properties by textual descriptions, but also with e.g. units, default values or lists of allowed values.

Requirements for a future model would be:

allow to provide metadata for resource and
allow to express parent child relations as today
allow to group (similar) resources together as today
allow to group resources together into applications. Note that this is different from what we see today as a mixed group, as there are dependencies e.g. in the are of availability, where the application can still be available if one application server fails. The resource model must allow to show those dependencies
allow to statically define resource types with metadata
allow to define resource (types) on the fly from an editor or by querying e.g. an mbean server
provide out of the box support for multiple tenants. Tenants can be completely different users / customers, but could also be different organizational units

The above basically consits of two larger areas:

defining a resource
defining relationships between resources

The next question is what properties does a resource have to have. And so often we can not know the answer in advance. Most probably those attributes should be present:

(technical) unique identifier(s) like a uuid - with a namespace (e.g. "{rhq}dead-beef" or {fabric8:c0de-cafe}. This allows to interact with other systems and to link a resource in RHQ with the resource in the other system
display name
human readable description
owner / tenant: This is to limit access to resources by users of a tenant.
list of key-value pairs, that provide additional properties
resource url that specifies how to reach the resource to operate on it. We need to define custom schemas like dmr:// if they do not yet exist.

Relationships between resources

Relationships in the model can be seen as a graph with the resources being nodes and links pointing to the other resource. Links will get names or tags identifying the relationship to the other resource. Take the following image as an example:

Here we have an application that consists of 4 components. We have links denoting the application (is "application" itself a resource?) with relationships "talks to". Then there are "runs on" relations as well as "includes". Depending on what is needed, we need to filter on the tags/names of the links.

There are already some relations depicted above. Another one could be "starts before" to indicate that the DB must be up before starting the application servers.

It must be possible to insert new resources in the graph. E.g in above drawing it must be possible to insert a third application server or a apache httpd in between load balancer and the application servers.

While discovery should find out as much as possible for those relations, we need to provide a manual way of manipulating the links.

Relationship of the resource / inventory model to other management systems

RHQ.next will not (always) live alone, but users may have other systems deployed like Fabric8, Red Hat Satellite, manageIq or others. In such a scenario, there may be a need to query the other system to gather data of a resource or to ask the other system to provision new data onto the resource.

As stated above, we need a way to identify the resource in the other system and mark it as "the same". We probably also need an indicator which system "owns" the resource and if multiple systems have data, an idea what data we can get from where.

With this information at hand, it will then be possible to e.g. tell Satellite to deploy a new rpm on what is known as "Plattform A" inside RHQ and then to tell Fabric8 to deploy an additional web application onto the EAP6 resource.

Model elements

What model elements do we need? Above we have identified the Resource and the Application.

Groups are a strong point in RHQ as well. Do we need to have explicit Groups or can they just seen as a special graph where all resources have a "member of" link pointing to what? The current DynaGroup language is certainly a plus that we should keep (and extend).

Additional (meta) data

The above only talks about resources and how they interact with each other. In reality resources do not exist inside management systems just for the joy of it, but because there are actions to be performed on them: run operations, provision data, take measurements and show their availability status. All of this needs additional (meta) data.

Now we have projects like RHQ metrics that are used to store (and display) metrics. The inventory needs to know what metrics are supposed to go there and may also need to specify collection intervals ("schedules") for metric taking (this could also go into RHQ Metrics though). I guess individual metrics need also to have an URL (Rest endpoint of RHQ Metrics) to identify and address it. The metadata of the metric itself (units, trendsup/dynamic, derived metrics) needs to live in RHQ Metrics though as e.g. display code needs to access it from there.

Similar for alerting. Inventory needs to have a link with alert definitions per resource, but the individual alert definition will live in the alert engine. The engine will work on individual incoming data items (e.g. metrics, logfile line, availability record). When creating the final alert object it may or may not need input from the Inventory to e.g. include resource name into the alert message.

Security / Access control

Access to resources and facets like operations or recored metrics may have the need to be protected.

Current RHQ already has a pretty good RBAC system, EAP another one.

Access roles of a user are per tenant id. A user can have access to multiple tenants though. Imagine individual OpenShift users as individual tenants and then an administrator that is monitoring the OpenShift platform itself which should be able to see metrics from multiple tenants.

For operations we need to make sure that the access control is per operation, as there are more dangerous ones ("reboot") that only users with elevated rights should be able to trigger. We need to investigate if such differentiation in rights should be applied to other areas as well (e.g. user can only see certain metrics of a resource).

I guess that a relatively generic setup can be used here

{namespace}:item-id:(tenant,access level)+

where the namespace defines the item class (metrics, alert, alert def, operation, ...) the item-id identifies the item and then there is the list of the access rights.

Inventory storage

While it is probably a bit too early to talk about future storage, it is clear that the actual hierarchical model inside a relational database with recursive queries has its drawbacks.

Thoughts on RHQ.next (server) architecture

Posted by pilhuhn Aug 14, 2014

Yesterday I blogged about RHQ-Alerts aka Alerts 2.0 (and who got the Wintermute in the subject?), we have RHQ-Metrics and Mazz wrote about RHQ-Audit and messaging. So let me show you a bigger picture that I have in mind and open it up for discussion.

But before I do this, let me first explain some motivation:

RHQ has some pretty good pieces that we want to open up for more general consumption by other projects. The current incarnation of RHQ itself is pretty monolithic and does not easily allow for separation.
Functionality is very synchronous - this is often not that visible because the GWT-UI has all those async callbacks. This synchronicity often makes clients wait on e.g database work and thus consuming resources that can otherwise be used differently (An agent delivering metrics only needs to know that the server has received the data and will process it. It has no advantages from waiting until they are finally stored).
With all those new components all bringing new apis (REST, Java), how do we wire them together? And how can we make sure that data that is e.g. fed into RHQ-Metrics will also be available for alerting?
How can we (especially for testing purposes) reduce the system complexity and

The approach that I am depicting here loosely adheres to the reactive manifesto. It has a central message bus as the backbone and then a set of (remote) APIs feeding into this bus. In above illustration, the dashed box depicts the "server" as opposed to more external things like UI, agents or 3rd-party apps talking to the server.

On the bus we can have several queues and topics for various tasks. The individual components would connect to the bus "via their Java api" (*) to receive messages for processing but also to submit messages on their own for further processing.

Incoming metrics need to be processed by rhq-metrics for storage and rhq-alerts for alerting
RHQ-metrics could compute aggregates at regular points in time and for one store them and also put them on the bus for further consumption by rhq-alerts
Audit messages would be stored away by RHQ-Audit
Alerts could be pushed via Websockets or Errai directly into the UI without the need for polling
As written in the RHQ-Alerts post, the CEP engine could create computed availability which is then put on the bus for further processing by a possible RHQ-Availability component

Also for scalability a messaging bus allows to e.g. have multiple consumers on a Queue that then process messages in turn. Those consumers can then live on different hosts and thus spreading the load. With multiple consumers on different hosts (and a HA message broker) a certain resilience can be achieved, as if one consumer fails, the other consumers can take over (at reduced speed). This implies that there will be no need to run all the components (rhq-metrics, ... ) in the same VM, but they still can do so in smaller setups.

Using a message bus also allows like REST-apis to also use other languages than Java for clients, which is good for further adoption and integration.

I briefly mentioned the UI already. For the normal display things each component will provide AngularJS widgets that will need to be combined. Those widgets can still talk to their respective Rest-Endpoints.

As you can see this is so far only a very rough design - the box labelled "inventory" is not connected to any of those other boxes and we certainly need to think hard how to connect it and how a future inventory should look like ( for example we need relations between resources that go beyond parent-child, also we should probably introduce the notion of "Application" as a set of such related resources; this would also play into alerting and SLAs).

*) What is meant here is that there will be tiny adapters from the bus to the respective java-api of the component in order to make the java api the "standard". I've written about this a few days ago in the light of rhq-metrics.

Thoughts on RHQ-Alerts aka Alerts 2.0 aka Wintermute

Posted by pilhuhn Aug 13, 2014

We have been thinking for a long time how to bring the Alerting from RHQ

to a new level
to other projects

For those who are less familiar with RHQ, I'll quickly describe the alerting possibilities: RHQ allows to alert on incoming metrics from monitoring and to compare the values with thresholds, possibly combining multiple metrics of a managed resource. Similar it is possible to trigger alerts on the outcome of resource operations, matched text in events (snmp traps, logfiles, ..), configuration changes and also changes in availability. Alerting can happen on single resources or group of resources and it is possible to define alerts on a template level, so that freshly added resources are directly enrolled into alerting. RHQ alerts have a few sorts of dampening to suppress duplicates when a problem persists and so on. For more information, you best consult the RHQ wiki.

Now while this is already quite powerful, there are numerous possibilities for enhancement like:

comparing of two metrics with each other (used mem is > than 80% of total mem)
comparing of metrics from different resources (if load balance is down or one of my 3 app servers in the cluster)
temporary component in reasoning (if night and only 1/3 servers is down send email; if day always send text message ; default: send email and text message and make a phone call )
finding an outlier in a group (if the load on one of my 3 boxes in the cluster is higher than on the others).

The RHQ-Wiki has a much bigger list.

As we have seen with RHQ-Metrics, there is also a similar need for alerting in other projects, which brings the same questions and thoughts on how to make Alerting available for them as well. Currently the thought go into the direction of also extracting alerting into its own project that can then be again pulled in into the next generation of RHQ and/or be used independently of RHQ.

The following graphic shows a possible architecture of this Alerting 2.0.

The core is would be a CEP (Complex Event Processing Engine) that offers a Java and REST-api (and perhaps messaging/JMS, feeding into the Java-api). The engine has a rule store, which could be a place in the Cassandra database of RHQ-metrics (each rule being a document) or for example just text files in git.

Modifications of the rules would work via the Engine and its APIs and would also be made available for auditing.

As with the current RHQ, there would be pluggable notification senders that do the sending of notification emails, txt messages, snmp traps and so on, just like in existing RHQ; we should in fact investigate how to best keep the existing senders or re-use hem with minimal effort.

Like in RHQ-metrics we will create and provide AngularJS directives for Alert-Definition-Editors to be re-used those directives may directly or indirectly also use other angular directives from e.g. RHQ-metrics to show a mini-graph of the metric so that the user can see the past values while defining the alert.

The most tricky part is probably a good composition of rules. We would support what we have today, which we could perhaps name "1-level" rules: they follow a "if (x and y) and not dampened then fire" pattern. In the future we would probably have more levels of rules like in this next (simple) diagram:

Input events would be correlated and then further processed into an alert and also into (resource) state. Resource state is similar to availability. We could even use this as the computed availability, where the incoming availability report from the resource is not taken literally, but processed by then engine. This allows to define computed availability as down when either the resource can not be pinged or requests take a minute to be processed.

A further rule can then take the time of day into account and see if the new state should end up in a notification to be created or not. Similar for any on-duty plans, where the notification is sent to the person on duty and not "randomly" to the whole operations group.

Depending on the severity another rule can define escalation handling: if the alert sent is not acknowledged within a certain amount of time, the engine would re-send the alert to the operation person or to a fallback person.

Ah and if you ask what this "Wintermute" in the subject is: go and read https://en.wikipedia.org/wiki/Neuromancer :-)

Announcing rhq-metrics

Posted by pilhuhn Apr 7, 2014

Over the past several weeks there have been discussions with other projects including Fabric8 and AeroGear who have a common need for metrics (storage, retrieval, aggregation, etc.) as well as graphs and charts. The monolithic structure and build process of RHQ makes it problematic for other projects to consume and reuse the metrics and graphing features of RHQ.

For this reason we have decided to break out the metrics backend into a separate project in an effort to make it more easily consumable by other projects. This project will provide APIs for storing, retrieving, and aggregating metric data. It will also include AngularJS directives to access and graph the data.

This new project is called RHQ-Metrics for now. RHQ itself will also be a consumer of this project just as is of other 3rd party libraries. RHQ-Metrics will allow you to use an existing Cassandra cluster or one provided by the project for getting started quickly. For small embedded applications and for testing, we also envision offering a purely in-memory data store that keeps data for a limited time series.

To make this effort a success we have created a forum at https://community.jboss.org/en/rhq/rhq-metrics where we will start gathering requirements and post additional project information. The project source code will live at https://github.com/rhq-project/rhq-metrics.

RHQ source is now on GitHub

Posted by pilhuhn Feb 28, 2014

It is my pleasure to announce that the source code of RHQ has been migrated over to GitHub as promised in the last "state of the union" posting.

This will now allow the community to more easily fork the repository and submit pull-requests for bugfixes and new features.

Thanks goes to Stefan Negrea for running the migration.

The repository can be found at rhq-project/rhq

The old existing repository on FedoraHosted has been set to read-only mode.

RHQ state of the union address, January 2014

Posted by pilhuhn Jan 23, 2014

[ This is the first of a series of postings describing where we stand with RHQ and where we may want to go to. ]

As many of you already know, RHQ is the upstream of JBoss Operations Network (JBoss ON) and so RHQ is following the requirements of JBoss ON.

RHQ has released 4.9 in September 2013 and is due for another release. That next release will include some performance enhancments on inventory syncing and agent footprint as well as a lot of fixes since RHQ 4.9. This next release is supposed to happen within the next month.

The RHQ team recently had a face to face meeting in Brno at the same time that the WildFly team was there as well. There are a few photos available ([1] and [2]) but we missed to create an all-team photo.

Some of the outcomes of that meeting are:

RHQ source code will definitively be moved from FedoraHosted to GitHub in the near future.
RHQ will in the future use the JIRA instance at jboss.org instead of Bugzilla (timing unknown at this point).

Those two actions will make it easier to contribute in the future when you already have an account at jboss.org

From a technical standpoint, we will tackle the following in the near to mid term - please comment on the respective wiki pages:

Audit Subsystem [3]
"Canned DynaGroup expressions" [4]
Add support for https connections of the as7-plugin
Add support Bundles and Drift for the Domain mode of as7/WildFly
Provide a mechanism for out-of-band signaling of state and events from the agent to the server [7] (the current proposal on the wiki is a first draft to illustrate the idea and needs much more work).
Implement signaling of as7/WildFly "dirty" state over the mechanism in the previous item

Of course we were also talking about the longer term and here mostly about two items:

UI: The current dependency on the SmartGWT widget set becomes more and more problematic as SmartGWT is not able to keep up with all the browser changes. In addition, the end of life of GWT DevMode is also reducing our ability to further improve the UI.

We have thus decided to investigate other options like using Angular.JS for further UI work. We will create a separate UI project on the RHQ GitHub account [5] and will create a UI that can be dropped into RHQ to get more insight how such a Angular.JS driven UI could work for RHQ. This Angular UI will talk via the existing REST-interface to the RHQ server.

Agent: Our current agent is a powerful workhorse, but has its shortcomings. One of those is its runtime footprint that is currently addressed. But even afterwards we want to try to reduce its impact, so that it is easier to run the agent on managed systems.

The other part is that the agent currently is only extensible in Java, while operations people are often much more versed in languages like Python. Also the possibility to ad-hoc define some metrics to be evaluated is rather limited (especially when it comes to arbitrary MBean attributes).

This means that we will evaluate alternatives to the current agent and also to evaluate how the current agent can be enhanced to support polyglot and smaller runtime environments.

In addition to the above we also were talking about enhancing the Alerting subsystem[8] and to introcude a 2-phase-discovery to improve on some of the shortcomings of the current resource discovery [6].

Last but not least we want to thank all contributors for their suggestions, bug reports, enhancement requests and code submissions.

Heiko on behalf of the RHQ team

Links:

[1] https://plus.google.com/114249341487134308671/posts/5nC9PXCDHT8

[2] https://www.facebook.com/media/set/?set=a.250786711757107.1073741829.250589718443473&type=1

[3] https://docs.jboss.org/author/display/RHQ/Audit+Subsystem

[4] https://docs.jboss.org/author/display/RHQ/Canned+DynaGroup+expressions

[5] https://github.com/rhq-project/angular-ui

[6] https://docs.jboss.org/author/display/RHQ/2-phase+discovery

[7] https://docs.jboss.org/author/display/RHQ/Agent-Server-Backchannel

[8] https://docs.jboss.org/author/display/RHQ/Wintermute

RHQ Standalone container with 10% more awesome now

Posted by pilhuhn Mar 23, 2011

I have committed two changes to the standalone container that will make life much easier when you are using it in the plugin development cycle:

There is a new command 'stdin' that can be used from within scripts to give control back to the console. This allows you to write a script to set up a scenario without manual intervention and to continue from there when this automatic setup is done.

This is a sample script:

disc i # discover platform services

disc s # discover servers

disc i # discover services below the servers

find r standard-sockets # find the resource with the name 'standard-sockets'

set r $r # store its resource id

src port-offset=1 # set the port-offset in the resource configuration to 1

set r -18 # now use the resource with id -18

rc # show its resource configuration

stdin # give control to stdin

New command 'src' (name may still change) to set a resource configuration. This has already been shown in the example above.

Format of the arguments is key=value

Those additions are now in master in the git repository and will make it into the next beta release of RHQ.

For more information check out this post.

If it were possible to detect debugger attachment, a 'debug' command would be possible to run script commands and then wait for the debugger before continuing.

Using the standalone container for development

Posted by pilhuhn Jan 20, 2011

I've recently created a video about how to use the so called 'standalone container' for rapid deployment and testing of plugins.

The standalone container itself is nothing but a tiny wrapper around the standard plugin-container in the agent. So by using the standalone container, you can deploy the plugins in a plugin container without the need to start a full blown RHQ server/agent setup.

The video is hosted on Vimeo.

Last week I was talking at the HfT Stuttgart to students and had them as exercise also writing a simple plugin and testing it in the standalone container. And while it was the most natural for me that I would know the names of operations or metrics, one of the students came up with the idea to add a command to list the allowed operations of a resource. So master has now the options to pass a '-list' parameter to invoke and measure commands to list the operations and metrics available.

UPDATE: Start scripts are now available from SourceForge.

JBossDeveloper

RHQ