-
1. HBase/HDFS (Hadoop) as a metrics store
pilhuhn Feb 7, 2011 3:41 AM (in response to genman)Hey,
you are not exactly running into an open door - but at least it is not locked shut :-) In fact we did already have discussions of using such a nosql store for the events subsystem ( Events tab, following of log files) . But you are absolutely right, that the metrics do not require the full relational and transactional guarantees and the eventually consistent should be good enough.
Pushing then to Hadoop/ HBase would also have the advantage that more sophisticated computation like baselines (or future statistical analysis) could be run on the Hadoop nodes as map-reduce jobs.
Having said this - it is not entirely trivial to "just switch", but most part of metric storage and retrieval is hidden behind one session bean, which could be swapped out by a Hadoop version. The harder thing would be to re-write the baseline computation.
Would you be interested in doing some coding in that area?
Heiko
-
2. HBase/HDFS (Hadoop) as a metrics store
genman Feb 7, 2011 6:45 PM (in response to pilhuhn)Of course I'm interested but it's going to be a question of if I can prove a need for this, I can work on this, and I can share back with the community my changes. So there's a lot of "ifs."
-
3. Re: HBase/HDFS (Hadoop) as a metrics store
pilhuhn Feb 8, 2011 12:02 PM (in response to genman)Hey,
I think a proof of concept would be nice where the Storage and Retreival of raw metrics (rhq_meas_data_num_r?? tables) would go to hadoop.
This would allow to get a feel for this (what is needed setup-wise, how could the performance look like).
If this looks good, it could serve as a starting point for more work. Also having a poc would allow to get feedback from other users.
How can I help you to get you going here?
Heiko
-
4. Re: HBase/HDFS (Hadoop) as a metrics store
genman Nov 8, 2011 6:01 PM (in response to pilhuhn)Having learned a lot about Hadoop, Hive provides a SQL-like interface with the storage of Hadoop.
https://cwiki.apache.org/Hive/hivejdbcinterface.html
Steps off the top of my head:
* Define the Hive schema. Definition of partitions.
* Necessary query changes
* Necessary code changes...Not sure Hibernate would work OOB?
* Necessary logic changes...For example, it may not be necessary to compact old data. It may be necessary to cache certain things, since Hive is quite slow.
* Build integration
* Configuration and installation. Part of setup, you'd indicate a secondary data store.
* Define the Hive schema
* Testing, etc.
Do you have access to a Hadoop cluster at RedHat?