4 Replies Latest reply on Nov 8, 2011 6:01 PM by genman

    HBase/HDFS (Hadoop) as a metrics store


      I asked a few weeks back about monitoring 500+ systems. I haven't had time to pursue this project, but I have thought about it a lot.


      It seems that given the large amount of data it potentially requires some high end database configuration and hardware, e.g. Oracle RAC and lots of fast disks. It sounds potentially quite expensive as well as labor intensive to set up such a system.


      How easy would it be to store just the inventory (complicated relational DB bits) in, say, Oracle, and log/gather metrics/events in HBase? HBase is a high performance datastore that Hadoop uses. It would be quite a bit more cost effective for a large organization. Metrics data isn't particularly valuable to require a relational store, and HBASE fits better into the logging/reporting model of fast append/frequent reads anyway.

        • 1. HBase/HDFS (Hadoop) as a metrics store


          you are not exactly running into an open door - but at least it is not locked shut :-) In fact we did already have discussions of using such a nosql store for the events subsystem ( Events tab, following of log files) . But you are absolutely right, that the metrics do not require the full relational and transactional guarantees and the eventually consistent should be good enough.


          Pushing then to Hadoop/ HBase would also have the advantage that more sophisticated computation like baselines (or future statistical analysis) could be run on the Hadoop nodes as map-reduce jobs.


          Having said this - it is not entirely trivial to "just switch", but most part of metric storage and retrieval is hidden behind one session bean, which could be swapped out by a Hadoop version. The harder thing would be to re-write the baseline computation.


          Would you be interested in doing some coding in that area?



          • 2. HBase/HDFS (Hadoop) as a metrics store

            Of course I'm interested but it's going to be a question of if I can prove a need for this, I can work on this, and I can share back with the community my changes. So there's a lot of "ifs."

            • 3. Re: HBase/HDFS (Hadoop) as a metrics store


              I think a proof of concept would be nice where the Storage and Retreival of raw metrics (rhq_meas_data_num_r?? tables) would go to hadoop.

              This would allow to get a feel for this (what is needed setup-wise, how could the performance look like).

              If this looks good, it could serve as a starting point for more work. Also having a poc would allow to get feedback from other users.


              How can I help you to get you going here?



              • 4. Re: HBase/HDFS (Hadoop) as a metrics store

                Having learned a lot about Hadoop, Hive provides a SQL-like interface with the storage of Hadoop.




                Steps off the top of my head:

                * Define the Hive schema. Definition of partitions.

                * Necessary query changes

                * Necessary code changes...Not sure Hibernate would work OOB?

                * Necessary logic changes...For example, it may not be necessary to compact old data. It may be necessary to cache certain things, since Hive is quite slow.

                * Build integration

                * Configuration and installation. Part of setup, you'd indicate a secondary data store.

                * Define the Hive schema

                * Testing, etc.


                Do you have access to a Hadoop cluster at RedHat?