4 Replies Latest reply on Jan 24, 2011 7:13 AM by pilhuhn

    RHQ managing 500+ machines

    genman

      I help manage a Nagios installation that has currently hundreds of machines, and considering RHQ. The machines run a variety of server packages.

       

      Nagios isn't great for a lot of things we need, say, tracking trends or allowing easily configured tolerances, monitoring log files, dealing in aggregate samples, JMX, etc. So I've taken a look at RHQ (and several competing products.) Customisation is very important as well.

       

      But given the design of RHQ, I wonder if it can handle say, 500-1000 agent connections at once? If it's possible to tune this, say, to talk less often that'd be ideal.

       

      I'm also concerned about inventory management, specifically importing a large number of nodes. And being able to use the UI effectively. I guess what'd be nice is if I could "import managed resources" from one of each type of server, then assign the rest as groups with the same set of resources.

       

      I'd also like to see better support for JMX. The JMX Plugin seems okay but requires quite a bit of XML work. I did take a look at support for dynamic metrics (?) but I suspect it's not going to appear in the next release.

        • 1. RHQ managing 500+ machines
          pilhuhn

          Hey Elias,

           

          your plan sounds interesting I know of installations with 400+ platforms - don't know the exact numbers. And I think scaling to more (from the pure connection point of view) should be possible - Mazz will for sure chime in later on this.

           

          I did not completely understand what you mean with respect to importing and this "for each type of server" - could you elaborate a bit more?

           

          With respect to the JMX plugin, our general approach is a different one than e.g. in Nagios (for how I understand Nagios ) or .e.g jconsole:

          RHQ tries to give things a semantic. Just returning a value of 42 by itself does not mean anything. So we add metadata which describes that the attribute x has a unit of kilobytes, is steadily increasing and so to give it a meaning. And this metadata is what you see in the plugin descriptor.

           

          What kinds of jmx-resources are we talking here? We already have support for a bunch of stuff via specialized plugins.

          The dynamic metadata work (in the 'Nagios' branch in git) is a first start, proof of concept. But even there, you (will) need some way of adding the metadata to the discovered resource types (e.g. via some sort of translation table).

           

          Back to the 500-1k agents: with this number of platforms and possible number of resources on it, you need to make sure that the database can cope with the load imposed by incoming metrics (what is better than many spindles? More spindles). 

           

          It would be interesting to learn more about what you are trying to do.

             Heiko

          • 2. RHQ managing 500+ machines
            mazz

            I have heard of some going up to at least 400 machines. IIRC, we tested with more than that (using our simulated agents and the agentspawn) - I thought we got close to 1000, but I really can't remember the upper limit we tested to.

             

            The number of agents that can be supported will depend on several things.

             

            First, the number of RHQ Servers you have in your setup. If you have only one RHQ Server, you probably won't be able to support 400 machines. You'll need 2 or 3 RHQ Servers.

             

            Second, it depends on the hardware that your RHQ Servers are running on. Obviously, 3 RHQ Servers running on laptops won't cut it - but 3 RHQ Servers running on quad-core, 16gig machines would probably do it.

             

            Third, it depends on the number of servers/services you have in inventory AND the number of metrics you have enabled AND how fast your collection intervals are in your metric schedules. If you have 400 machines each with a JBossAS server running a single WAR web app, and you are collecting a few metrics at 10 minutes a piece - the performance behavior of RHQ will be drastically different if, say, each of your machines has 10 JBossAS servers each hosting several EAR/WAR apps and you are collecting double the amount of metrics every 2 minutes. The data flow into the server is a major component in determining the behavior of the RHQ system.

             

            Fourth, the database. This is probably one of, if not the, most important factor in the RHQ setup. We have people collecting massive amounts of metrics with large inventories (dozens of machines running 10s of (if not a hundred) JBossAS servers). But this requires a beefy database setup - a highly tuned Oracle instance running on a large machine is typically required.

             

            We also had done some internal tests in the past with a database using solid-state drives (SSD) and I was amazed at the performance and thru-put we were able to witness. It was very impressive.

            • 3. RHQ managing 500+ machines
              genman

              Thanks for your helpful information.

               

              I'll need to learn more on my own about what sort of configuration we're looking to go with. I don't really have requirements right now.

               

              We are looking at gathering information about Hadoop and JBoss servers. Probably the dataset will be limited to a number of important indicators for each service, and not every thread pool, etc.

              • 4. RHQ managing 500+ machines
                pilhuhn

                RHQ is good for monitoring JBossAS servers and it has a (very basic) Hadoop plugin - this may be a starting point for your investigations.