your plan sounds interesting I know of installations with 400+ platforms - don't know the exact numbers. And I think scaling to more (from the pure connection point of view) should be possible - Mazz will for sure chime in later on this.
I did not completely understand what you mean with respect to importing and this "for each type of server" - could you elaborate a bit more?
With respect to the JMX plugin, our general approach is a different one than e.g. in Nagios (for how I understand Nagios ) or .e.g jconsole:
RHQ tries to give things a semantic. Just returning a value of 42 by itself does not mean anything. So we add metadata which describes that the attribute x has a unit of kilobytes, is steadily increasing and so to give it a meaning. And this metadata is what you see in the plugin descriptor.
What kinds of jmx-resources are we talking here? We already have support for a bunch of stuff via specialized plugins.
The dynamic metadata work (in the 'Nagios' branch in git) is a first start, proof of concept. But even there, you (will) need some way of adding the metadata to the discovered resource types (e.g. via some sort of translation table).
Back to the 500-1k agents: with this number of platforms and possible number of resources on it, you need to make sure that the database can cope with the load imposed by incoming metrics (what is better than many spindles? More spindles).
It would be interesting to learn more about what you are trying to do.
I have heard of some going up to at least 400 machines. IIRC, we tested with more than that (using our simulated agents and the agentspawn) - I thought we got close to 1000, but I really can't remember the upper limit we tested to.
The number of agents that can be supported will depend on several things.
First, the number of RHQ Servers you have in your setup. If you have only one RHQ Server, you probably won't be able to support 400 machines. You'll need 2 or 3 RHQ Servers.
Second, it depends on the hardware that your RHQ Servers are running on. Obviously, 3 RHQ Servers running on laptops won't cut it - but 3 RHQ Servers running on quad-core, 16gig machines would probably do it.
Third, it depends on the number of servers/services you have in inventory AND the number of metrics you have enabled AND how fast your collection intervals are in your metric schedules. If you have 400 machines each with a JBossAS server running a single WAR web app, and you are collecting a few metrics at 10 minutes a piece - the performance behavior of RHQ will be drastically different if, say, each of your machines has 10 JBossAS servers each hosting several EAR/WAR apps and you are collecting double the amount of metrics every 2 minutes. The data flow into the server is a major component in determining the behavior of the RHQ system.
Fourth, the database. This is probably one of, if not the, most important factor in the RHQ setup. We have people collecting massive amounts of metrics with large inventories (dozens of machines running 10s of (if not a hundred) JBossAS servers). But this requires a beefy database setup - a highly tuned Oracle instance running on a large machine is typically required.
We also had done some internal tests in the past with a database using solid-state drives (SSD) and I was amazed at the performance and thru-put we were able to witness. It was very impressive.
Thanks for your helpful information.
I'll need to learn more on my own about what sort of configuration we're looking to go with. I don't really have requirements right now.
We are looking at gathering information about Hadoop and JBoss servers. Probably the dataset will be limited to a number of important indicators for each service, and not every thread pool, etc.
RHQ is good for monitoring JBossAS servers and it has a (very basic) Hadoop plugin - this may be a starting point for your investigations.