Hello Joseph,
thank you very much for your answer!
Well first of all: I really like the idea of RHQ as Java based "System Monitoring and Management" system (or however you would like to call it). I do not know the Background what RHQ really wants to be (Managemet vs Monitoring) but I think one of the most widley depolyed Server applications wolrdwide is actually "Nagios". Every enterprise has the need to monitor resources. RHQ does a big step in the right direction. But currently it seems it is not able to substitue Nagios. And whats so "sad" is, that RHQ offers every thing to be the "ultime" Java Nagios substite. But Limiting the monitoring Intervalls to 30 Seconds and the Agent Communication to 1 Minute really kills this idea completely. I have deployed only Java Applications and I would love to not use Nagios and use a pure Java Application like RHQ to Monitor my systems and dvelop any plugins in Java. But for real uptime Monitoring (and triggering Failovers) I need "sub second" (I actually mean round about one second) Monitoring intervalls. I know this is not cheap, but I think this should be left as a problem to me as administrator. 1 second are 60x60 =3600 Samples per Hour. This is nothing compared to what one single WepageRequest consumes in Bandwith (One page of mine has 100 KB HTML and 200 KB Image data) and in CPU Time for generating this page form the Database... So in the end 3600 Samples is less then one normal Application Server Request consumes and I will have hundert of thousands on my server per Hour with a powerfull server... If you think this might introduce so much problems then you should print out a big notice when someone configures small intervalls like one second. But noth offering this features will lead to the fact that a lot of people can not use RHQ in their Setup for montioring and further more not as a Plattform for developing new cool plugins. My Idea was to develop some kind of (ISP Specific) Failoverplugin but therefore I need sub second monitoring intervalls.
I think theres a huge market for Monitoring out there, as every enterprise has the need to monitor! RHQ offers 99,9 Percent, but due to the fact it does not offer "sub second" monitoring Intervalls its like offering 0% as I can not use it. 30 seconds is way too much in todays world. And as said an intervall of 1 second is nothing in todays 8 Core 8 GB Ram machines. Does RHQ know its userbase by any surveys? I think a lot of companies will use RHQ in stups with 5-10 Servers so the amount of data is very limited. And secondly of course not all monitoring will be "sub second". This will only be one (!!) or two "services" per Plattform that are mission critical.
I am writing this to "help" RHQ. I have had a look at so many Monitoring Solutions like Zenoss, OpenEMS,Nagios, hyperiq and many more and I would be very happy with RHQ. I hate coding any shell monitoring Scripts like for Nagios...(as I am a Java Programmer. A lot of other Java Programmers are for sure also looking for a Nagios substitute in Java) .RHQ offers its great Plugin Concept and is the ideal Framework to implement further logic.
###############
Yes your suggestion is great: It would be absolutely great to have the following:
1.) Agent Configuration: In the XML Config File of the Agent (or even remotely over the RHQ Admin interface) I can specify how "fast" the Agent should send its "collected" data to the Server. This leads to the fact that in theorey it is possible to achiev "near realtime" communication between the Agent and the Server (with sub second sampling).
2.) In the Server it would be great to have a configuration option:
A) With what intervall samples for one "Resource/Monitoring Point" should be _processed_ and recived from the remote agent
B) and with what intervall the Sample should be _stored_
C) and with what logic (as you have written) in case storing and processing is not the same, should be "aggregated, averaged or transformed".
=>So its like a Funnel. RHQ should receive more or less realtime data (when configured) and then I as the plugin developer or administrator can define for every resource what of the data should be thrown out. This can happen all in cheap memory.
3.) Maybe it is also necessary to introduce some logic / configuration option, on what Basis the Alerts should be based. On the "Reatime data" or on the "filtered/averaged/transformed..." data.
=> RHQ is such a great "Monitoring Framework" and it should internally work with the highest precision (keeping the Data in Memory for processing) so that I can develop my own Plugins that act on this "realtime" data with high precision. But then Storing is in most cases not so important. To save discspace it would be ok to throw away some of the high precision data and sample it down from memory and store it efficiently in the Database.
4.) What the coolest thing would be (might make it complicated) if I can dynamically(!) change the "intervall rate for persisting". For example I Sample the Data with an intervall of 1 second. In the Database I however store only the Average value every ten seconds. Now I configure some kind of alert/event if the value goes below a certain value. If now while monitorying every second one value falls below the defined value I will trigger the Alert/Event and(!) also store for a specified timeframe the values in the database with a resolution of 1 second. This is like a "zoom" function. If any problem happens, then I will have a very detailed view on it as I swith the resultion of the sampled data (also in the database). When everythin runs normal again, the system will swith back and only stores the averaged data. (I think this is very similar to what you described).
So I am writing this as I was really very happy having found RHQ but I am still afraid I will have to use Nagios as I can not use RHQ due to its high Monitoring Intervalls. Maybe this post helps you think about it. It think this would be a great step forward to make RHQ as some kind of "Java Monitoring Framework" where other developers can plugin their on plugins (also for more mission critical applications that require subsecond data).
Thank you very much
Jens