6 Replies Latest reply on Nov 12, 2010 3:51 AM by pilhuhn

Some Questions = Alerts trigger other Programs | Monitoring Intervalls | Suggestions for Improvements

maildevelopmenttest Nov 10, 2010 9:58 PM

Hello Experts,

RHQ seems to be a great product to monitor any resources. But I have the requirement to not only monitor any resource but to also trigger "External Actions" based on any "Alert Conditions": For Example a Server goes down. RHQ allows to send an email. But what I have to achieve is, that I need for example to call a webservice that will bring up a backup server! Does RHQ provide some "Plugins-Concepts" for the Alerts to not only send an email but rather execute any arbitrarty Plugin or Java Code

Improvement Possibility: Its seems RHQ does have the 30 seconds monitoring limit (as seen inanother posting, this limit can be manually patched.) However I think a great percentage of the users of RHQ will not have "hundertthousands" of services to monitor. I think a standardsetup by a lot of user might comprise 5 to 10 Servers only monitoring the most important Services. So the data collected is quite limited and will not overload the system. But what is actually many times required, is to have faster Monitoring Intervall. I have to monitor a Webserver and bring up alternative service. Therefore I need to operate on a 1 Second Monitoring basis. But this is of course limited to only one or two critical Services per Server. So the generated data is limited and could also be discarded/aggregated before storing them in the DB for historical reasons).

=>You are loosing a big marketshare and userbase by limiting the polling Intervalls to 30 seconds. Making this configurable for every user will allow a lot more users to use your system! RHQ is a great plattform to integrate further plugins into but a 30 second monitoring limit is for critical services too much!)

Now my question related to monitoring intervalls: Does RHQ offer (or what whould be the best approach to implement this) to have for example a monitoring intervall of 10 Seconds. If the Service goes down (the first checks returns DOWN status) I want to start monitoring with an intervall of 1 second and reconfirm(!!) the down Status. I will do for example 5 checks (ever Second a check) and only after this 5 failed checks, I want to report the Service as down? Is there a possiblity to configure existing Plugins (the Alerts) to work like this?

In my own plugins (Heiko has written a great tutorial about Plugins, Thanks!!!) I wold have to implement this logic directly in the method that checks for up and down status? Or is there any clever way to configure this kind of monitoring (changing the monitoring intervall for reconfirmation) in the Alets-Logik in the GUI?

Thank you very much for any advice!!!!

Jens

1. Re: Some Questions = Alerts trigger other Programs | Monitoring Intervalls | Suggestions for Improvements

joe.marques Nov 11, 2010 12:46 AM (in response to maildevelopmenttest)

The minimum collection interval of 30 seconds is hard-coded in the system and today can not be changed without modifying the source code; it is to prevent users from figuratively shooting themselves in the foot.

Even if we were to collect the data faster, you wouldn't see the data any faster. Measurement reports are not sent up to the server immediately after the data is collected, they too are based on a periodic interval, currently 1 minute. So the values you're looking for would still be delayed.

-----

Your comment about collecting additional data, processing it in some way, and then storing something different for auditing / historical purposes is intriguing. We've discussed this before (recently) in the context of of sending up full availability reports (each resource reports UP or DOWN every reporting interval), which would give us the ability to create alerts like "DOWN collected 3 times in a row" (which helps buffers users from "jitters", but then only storing the RLE (run-length encoded) data in the alert history table. A similar style could be used for metrics that are collected quickly too. A metric collected every 5 seconds would have 12 data points in each metric report, where they could all be processed, but only 2 of them need to be stored to simulate an effective 30-second collection interval (or perhaps each point can be an average of 6 of the points, who knows).

The point is, if we allow more data to come across the data, we can process all of it but we only need to store some of it (or transform it, or aggregate it, or average it) for historic purposes.

Would this feature help satisfy some of your requirements?
Actions
2. Re: Some Questions = Alerts trigger other Programs | Monitoring Intervalls | Suggestions for Improvements

maildevelopmenttest Nov 11, 2010 3:33 AM (in response to joe.marques)

Hello Joseph,

thank you very much for your answer!

Well first of all: I really like the idea of RHQ as Java based "System Monitoring and Management" system (or however you would like to call it). I do not know the Background what RHQ really wants to be (Managemet vs Monitoring) but I think one of the most widley depolyed Server applications wolrdwide is actually "Nagios". Every enterprise has the need to monitor resources. RHQ does a big step in the right direction. But currently it seems it is not able to substitue Nagios. And whats so "sad" is, that RHQ offers every thing to be the "ultime" Java Nagios substite. But Limiting the monitoring Intervalls to 30 Seconds and the Agent Communication to 1 Minute really kills this idea completely. I have deployed only Java Applications and I would love to not use Nagios and use a pure Java Application like RHQ to Monitor my systems and dvelop any plugins in Java. But for real uptime Monitoring (and triggering Failovers) I need "sub second" (I actually mean round about one second) Monitoring intervalls. I know this is not cheap, but I think this should be left as a problem to me as administrator. 1 second are 60x60 =3600 Samples per Hour. This is nothing compared to what one single WepageRequest consumes in Bandwith (One page of mine has 100 KB HTML and 200 KB Image data) and in CPU Time for generating this page form the Database... So in the end 3600 Samples is less then one normal Application Server Request consumes and I will have hundert of thousands on my server per Hour with a powerfull server... If you think this might introduce so much problems then you should print out a big notice when someone configures small intervalls like one second. But noth offering this features will lead to the fact that a lot of people can not use RHQ in their Setup for montioring and further more not as a Plattform for developing new cool plugins. My Idea was to develop some kind of (ISP Specific) Failoverplugin but therefore I need sub second monitoring intervalls.

I think theres a huge market for Monitoring out there, as every enterprise has the need to monitor! RHQ offers 99,9 Percent, but due to the fact it does not offer "sub second" monitoring Intervalls its like offering 0% as I can not use it. 30 seconds is way too much in todays world. And as said an intervall of 1 second is nothing in todays 8 Core 8 GB Ram machines. Does RHQ know its userbase by any surveys? I think a lot of companies will use RHQ in stups with 5-10 Servers so the amount of data is very limited. And secondly of course not all monitoring will be "sub second". This will only be one (!!) or two "services" per Plattform that are mission critical.

I am writing this to "help" RHQ. I have had a look at so many Monitoring Solutions like Zenoss, OpenEMS,Nagios, hyperiq and many more and I would be very happy with RHQ. I hate coding any shell monitoring Scripts like for Nagios...(as I am a Java Programmer. A lot of other Java Programmers are for sure also looking for a Nagios substitute in Java) .RHQ offers its great Plugin Concept and is the ideal Framework to implement further logic.

###############
Yes your suggestion is great: It would be absolutely great to have the following:

1.) Agent Configuration: In the XML Config File of the Agent (or even remotely over the RHQ Admin interface) I can specify how "fast" the Agent should send its "collected" data to the Server. This leads to the fact that in theorey it is possible to achiev "near realtime" communication between the Agent and the Server (with sub second sampling).

2.) In the Server it would be great to have a configuration option:
A) With what intervall samples for one "Resource/Monitoring Point" should be _processed_ and recived from the remote agent
B) and with what intervall the Sample should be _stored_
C) and with what logic (as you have written) in case storing and processing is not the same, should be "aggregated, averaged or transformed".
=>So its like a Funnel. RHQ should receive more or less realtime data (when configured) and then I as the plugin developer or administrator can define for every resource what of the data should be thrown out. This can happen all in cheap memory.

3.) Maybe it is also necessary to introduce some logic / configuration option, on what Basis the Alerts should be based. On the "Reatime data" or on the "filtered/averaged/transformed..." data.

=> RHQ is such a great "Monitoring Framework" and it should internally work with the highest precision (keeping the Data in Memory for processing) so that I can develop my own Plugins that act on this "realtime" data with high precision. But then Storing is in most cases not so important. To save discspace it would be ok to throw away some of the high precision data and sample it down from memory and store it efficiently in the Database.

4.) What the coolest thing would be (might make it complicated) if I can dynamically(!) change the "intervall rate for persisting". For example I Sample the Data with an intervall of 1 second. In the Database I however store only the Average value every ten seconds. Now I configure some kind of alert/event if the value goes below a certain value. If now while monitorying every second one value falls below the defined value I will trigger the Alert/Event and(!) also store for a specified timeframe the values in the database with a resolution of 1 second. This is like a "zoom" function. If any problem happens, then I will have a very detailed view on it as I swith the resultion of the sampled data (also in the database). When everythin runs normal again, the system will swith back and only stores the averaged data. (I think this is very similar to what you described).

So I am writing this as I was really very happy having found RHQ but I am still afraid I will have to use Nagios as I can not use RHQ due to its high Monitoring Intervalls. Maybe this post helps you think about it. It think this would be a great step forward to make RHQ as some kind of "Java Monitoring Framework" where other developers can plugin their on plugins (also for more mission critical applications that require subsecond data).

Thank you very much
Jens
Actions
3. Re: Some Questions = Alerts trigger other Programs | Monitoring Intervalls | Suggestions for Improvements

pilhuhn Nov 11, 2010 3:53 AM (in response to maildevelopmenttest)

send an email. But what I have to achieve is, that I need for example to call a webservice that will bring up a backup server! Does RHQ provide some "Plugins-Concepts" for the Alerts to not only send an email but rather execute any arbitrarty Plugin or Java Code

Actually that is possible in RHQ3 via Alert Sender plugins. See e.g. this posting , the original design page or the server side plugin writing documentation.

In my own plugins (Heiko has written a great tutorial about Plugins, Thanks!!!) I wold have to implement this logic directly in the method that checks for up and down status? Or is there any clever way to configure this kind of monitoring (changing the monitoring intervall for reconfirmation) in the Alets-Logik in the GUI?

If you write your own alert sender plugin (and you can have multiple and stack them). you can write one that changes the measurement schedule from within that plugin. You need to define a "recovery alert" then, which takes the schedule interval back to its 'normal' level when there is no more error condition.

As Joseph said, we have the 30sec minimum to prevent users from shooting themselves into the foot (and we have just recently seen a case where 30sec on hundreds or resources with dozens of metrics can still be harmful).
Having reiterated this, I could imagine something like a floating minimum, where we say "server and agents can process MPS (metrics+availabilities) per second. Depending on how close I am to this value, I can have shorter or longer intervals. If I am close to MPS and I want 1s for one metric, I need to e.g. re-schedule some others to 10mins or such.

You don't want to work on this? :-)

Btw.: What plugin are you writing?
Actions
4. Re: Some Questions = Alerts trigger other Programs | Monitoring Intervalls | Suggestions for Improvements

pilhuhn Nov 11, 2010 4:14 AM (in response to maildevelopmenttest)

Jens, why don't you join us on #rhq on irc.freenode.net (especially in the German afternoon) when the devs are hanging around ?
Actions
5. Re: Some Questions = Alerts trigger other Programs | Monitoring Intervalls | Suggestions for Improvements

maildevelopmenttest Nov 12, 2010 2:06 AM (in response to pilhuhn)

Hello Heiko,

thank you very much for your answer!!

Well my intention was to develop at least one plugin, that allows me to use RHQ to trigger failovers for the monitored hosts: For example I have five Servers and I will monitor them individually. When one goes down, every node has its own list of "fallback/backup" nodes I will then redirect the traffic to, either via internal rerouting (Failover IP) or changing the DNS Records (IP- / DNS-Failover). Of course this will be very ISP specific, as I have to use the "API" of the repsecitve ISP to do the IP/DNS-Failover. One Idea is to abstract it a little bit (and provide a generic plugin) by allwoing external "API-Shell Scripts" to be called. So everyone can include their on API-Clients (maybe provided by the ISP)...

Thanky for your invitation to join your chat. Well please dont lough, I do not have any IRC Clients installed. If I decide to use RHQ for my monitoring, I will of course try to contribute (for example by writing a plugin). Ia m afrai, Implementing your MPS-Functionality is currently too ambitous, as I am quite new to the programming stuff (I worked up till now more in the conceptual area).

Nevertheless please let me also add two further thoughts about RHQ:
1.) Regarding your Idea with the "Metrics per Second"(MPS). I really like the idea!! But when thinking about it, I was reminded of Nagios (I hope I recall it correclty, no quarantee): Nagios has a quite simple but very easy to understand and also scalable concept for this, by showing the "Delay" / "Backlog". Of course the MPS could be bulid like Vista does it, by "probing/benchmarking" your hardware and then
calculating the maximum MPS, as the MPS will most likely be very Hardwaredependant. I do not yet know the internal workings of RHQ but I would assume that one of the performance bottlenecks will most likely be the persisting of data (and the recreation of the DB indexes)? So instead of directly limiting the metrics in advance, I think it would more helpfull to have some kind of "backlog" indicator on the right top corner, that states if the system can cope with the data it receives. This would make it very obvious and furthermore(!!) provide a real benefit and concrete information: Even with with an MPS that is dynamically calculated (like the performance index in Vista). The system may be temporarily slowed down due to other reasons(other applications consuming the cpu or due to a high diskload...). So this will then lead to the fact even with having limited the Metrics and being in the allowed MPS range, RHQ is not able to cope with the data it receives (due to a temporarily slow/busy system). However having some kind of "backlog/performance" indicator, I would at once see, that the system is currently under a too heavy load and can not cope with the data that comes in. Now I either can shut of other applications, buy a better machine or reduce the number of metrics.... So I think printing the "current workload/backlog" would be more userfriendly and easier to understand and provides a real benefit / information. But I would also be very happy with something like the MPS, the most important aspect is simply, that the limit of 1 Minute / 30 seceonds is reduced to under a second (if configured so)...

2.)Another thought about RHQ (related to limiting the Metrics per Second): RHQ ist great, I was really happy when I discovered it. As already written, Monitoring is the Basic neeed of every company out there. I personally think Nagios is not the best approach of monitoring but I know so many companies that use it nevertheless: Looking at Millions of companies out there, with a top down approach is: The most important use-case of a monitoring solution is to simply monitor the "Availability" of their Services. And with "Availability" I mean the "end user availability". RHQ is absolutely great. I installed it and I think I had (with a more or less completly new debian installation) about 50 metrics. But when I think about it, what the most important metric is to me, then it's the "End-User Availability" for
1.) HTTP (Content Regexp Check, and Processing / Round Trip Time)
2.) SMTP
3.) Ping (for network speed withouth the higher App Stacks)
4.) Maybe DNS...
Its great that RQH offers such a deep insight into any systmes/ JVMs/Tomcat/JBoss and the likes. But in the end the most important are only 3 Checks. Dont get me wrong, I know its very important to prevent(!) any outages by monitoring quite a lot of parameters. But in large systems there are so many interdependencies, so in the end the "availability" monitoring will always be the most important Metric! (Take for example deadlocks, bugs and stuff like this, this can only be monitored by monitoring the end-user availability (regexp checking the result ...) My impression was, these vey important use-cases are not that prominently adrressed by RHQ? There is no SMTP/DNS Plugin yet?
And also a generic HTTP Plugin seems not to be available (besides your tutorial)? => So one reason for performance problems is, that you offer such a great functionality and so many good plugins. This will lead to the fact that no one disables anything. But when I am honest, the most important metric is the "availybility" metric that I want to have in realtime.
However the availybility metric is not(!?) really included in the "standard" plugins-set, with a "ping" or "http" Monitoring plugin (I might be wrong, but I have not found a ping or http plugin in the default install). => When installing an agent, there should be a default plugin/metric for monitoring at least the Ping and Http-Availability for the respecitve agent (using the agents Hostname as default)...

There are thousands of companies even offering commercial (billed every month) "web based availability monitoring for http/smtp/dns...".
I think there's only missing a very little bit of tweaking, to adress this huge market and provide the "best monitoring" solution on the market;-)

thank you very much!
jens
Actions
6. Re: Some Questions = Alerts trigger other Programs | Monitoring Intervalls | Suggestions for Improvements

pilhuhn Nov 12, 2010 3:51 AM (in response to maildevelopmenttest)

Jens,
wow this is lengthy - I need to read that again, but just wanted to quickly point you at the 'netservices' plugin which already does ping and http.

I think for smtp to be really useful, one would need a more sophisticated plugin than just pinging the smtp port; this plugin would also need to see if the email got actually delivered by e.g. checking a special pop3 account.

Btw: contributing is not only coding, but also asking good questions and giving ideas about how to implement stuff. So welcome to the club :-)

Heiko
Actions

Go to original post