13 Replies Latest reply on Apr 11, 2014 8:59 PM by pilhuhn

    Data Model

    heiko.braun

      Can sombody provide a reference to the current data model used by RHQ?  I.e. the cassandra schema, supported metric types, etc?

        • 1. Re: Data Model
          heiko.braun

          Some additional pointers about bucketing, rollups, etc would help to understand how the data is stored and processed.

          • 2. Re: Data Model
            theute

            This https://docs.jboss.org/author/display/RHQ/Cassandra+Back+End+Design+Notes (and the subpages)

            may be a good start, John or Stefan may have other info.

            • 3. Re: Data Model
              theute

              This

              https://docs.jboss.org/author/display/RHQ/CassandraBackEndDesignNotes (and

              the subpages)

              may be a good start, John or Stefan may have other info.

              • 4. Re: Data Model
                heiko.braun

                For the records: Some information about the aggregation scheme can be found here: Aggregation Schema Changes - RHQ - Project Documentation Editor

                • 5. Re: Data Model
                  john.sanda

                  Hi Heiko,

                   

                  I will write up some additional docs within the next day or two that provide an overview. Aggregation Schema Changes - RHQ - Project Documentation Editor is a good reference, but it only covers schema changes being introduced in RHQ 4.11.

                  • 6. Re: Data Model
                    heiko.braun

                    Thanks John. You've already done a great job of documenting the C* architecture and design.

                     

                    I do not yet fully understand how the raw data is computed into aggregates and how it breaks down into column families. Maybe you can provide a high level overview how the data moves through the system and how this corresponds to the column families, rows and columns?


                    For instance what's the relation between the raw data, the index and buckets? I guess it's something like "values", "to be computed" and "results"?

                    • 7. Re: Re: Data Model
                      john.sanda

                      Let me provide an abridged version for now, and I will follow up with a more comprehensive doc. Since RHQ uses CQL, I will try to stick with CQL terminology as well. Things are a lot more involved with the changes described in https://docs.jboss.org/author/display/RHQ/Aggregation+Schema+Changes so the details I cover here will pertain more so to RHQ 4.9.

                       

                      RHQ agents report raw metrics to the RHQ server which in turns stores the data in Cassandra. Every hour a job runs that aggregates raw data into 1 hour aggregate metrics. Every six hours, we generate 6 hour aggregate metrics using the 1 hour data as input. And every 24 hours we generate 24 hour aggregate metrics using the 6 hour aggregate metrics as input. There are separate tables corresponding to each of the raw, 1 hr, 6 hr, and 24 hr metrics.

                       

                      Each raw datum that the agent reports corresponds to a measurement schedule which in turn is derived or realized from a measurement definition. The measurement definition is system-wide meta data that describes the measurement, e.g., units, trend up/down, etc. The measurement schedule is per resource. It specifies the collection interval used by the agent. The schedules are stored in the relational database. Each schedule has a unique id, and we use that id as the partition key for each of metrics tables in Cassandra. Note that many CQL rows can make up a single partition which in terms of physical storage layout is a single, wide row on disk.

                       

                      The metrics_index table is a custom index that is used to determine which schedules have data to be aggregated for a given time slice. For example, suppose the aggregation job runs at 14:00. It will aggregate raw data that was collected during the 13:00 - 14:00 time slice. We query the index to get a set of all the schedule ids for which data was inserted during that time slice.

                       

                      In the schema design doc, I used the term buckets to generically refer to the raw_metrics, one_hour_metrics, six_hour_metrics, and twenty_four_hour_metrics tables.

                       

                      As an aside, most of the schema changes that went into RHQ 4.9 were made early on in the Cassandra 1.2.x time frame while a lot of CQL features were still being added. With CQL collections at our disposal, we can more easily store data for all the buckets in a single CQL table. I believe there may be some advantages for doing so, but of course further testing and analysis is needed.

                       

                      Hope this helps.

                      • 8. Re: Data Model
                        heiko.braun

                        Thanks John, that's explain it.

                        • 9. Re: Data Model
                          heiko.braun
                          Each schedule has a unique id, and we use that id as the partition key for each of metrics tables in Cassandra. Note that many CQL rows can make up a single partition which in terms of physical storage layout is a single, wide row on disk.

                           

                          So the schedule ID becomes the row key, right?  Regarding "many rows, single partition": Can elaborate on this?

                          • 10. Re: Re: Data Model
                            john.sanda

                            Yes, the schedule id is the row key, or partition key in CQL terminology. An example might help. Here is some sample output from cqlsh,

                             

                            cqlsh:rhq> select * from raw_metrics;

                             

                             

                            schedule_id | time                     | value

                            -------------+--------------------------+--------

                                     100 | 2014-04-08 16:20:00-0400 |      3

                                     100 | 2014-04-08 16:40:00-0400 |      5

                                     100 | 2014-04-08 17:20:00-0400 |     11

                                     100 | 2014-04-08 17:40:00-0400 |     16

                                     131 | 2014-04-08 16:07:00-0400 |   3.14

                                     101 | 2014-04-08 16:15:00-0400 | 0.0032

                                     101 | 2014-04-08 16:30:00-0400 |  0.104

                                     101 | 2014-04-08 17:30:00-0400 |  0.092

                                     101 | 2014-04-08 17:45:00-0400 | 0.0733

                             

                            Note that there are 4 (CQL) rows with schedule_id 100. Those 4 rows all live in the same partition having the partition/row key of 100. In terms of physical storage, it would look like,

                             

                            Row KeyColumn name/valueColumn name/valueColumn name/valueColumn name/value
                            1002014-04-08 16:20:002014-04-08 16:40:002014-04-08 17:20:002014-04-08 17:40:00
                            341116

                             

                            Hashing is done using the partition key. Each replica owns a partition which means a single node can satisfy a query like SELECT * FROM raw_metrics WHERE schedule_id = 100 AND time < '2014-04-08 17:00:00'.

                            • 11. Re: Re: Data Model
                              john.sanda

                              The hashing remark might not have been clear. Let's say we have a 5 node cluster with a replication_factor of 3. This means that a copy of the partition/row with schedule id 100 will exist on 3 nodes. Those 3 nodes are the replica's for that partition.

                              • 12. Re: Data Model
                                heiko.braun

                                Thanks. Now I see what you mean and at the same time I realize why people have difficultities bridging from CQL to the actual CF layout. But this example explains it perfectly.

                                • 13. Re: Re: Data Model
                                  pilhuhn

                                  Am 08.04.2014 um 16:27 schrieb John Sanda <do-not-reply@jboss.com>:

                                  In the schema design doc, I used the term buckets to generically refer to the raw_metrics, one_hour_metrics, six_hour_metrics, and twenty_four_hour_metrics tables.

                                   

                                  Within RHQ we also sometimes use the term "Bucket" to denote one "bar" inside a metrics graph.

                                  Our metrics graph by default have 60 such bars and the default display interval is 8h for the whole graph, so a bar corresponds to all the raw metrics that have been collected during a certain 8mins interval - so they fall into this bucket.