-
1. Re: Data Model
heiko.braun Apr 8, 2014 5:02 AM (in response to heiko.braun)Some additional pointers about bucketing, rollups, etc would help to understand how the data is stored and processed.
-
2. Re: Data Model
theute Apr 8, 2014 5:43 AM (in response to heiko.braun)This https://docs.jboss.org/author/display/RHQ/Cassandra+Back+End+Design+Notes (and the subpages)
may be a good start, John or Stefan may have other info.
-
3. Re: Data Model
theute Apr 8, 2014 5:49 AM (in response to heiko.braun)This
https://docs.jboss.org/author/display/RHQ/CassandraBackEndDesignNotes (and
the subpages)
may be a good start, John or Stefan may have other info.
-
4. Re: Data Model
heiko.braun Apr 8, 2014 6:21 AM (in response to heiko.braun)For the records: Some information about the aggregation scheme can be found here: Aggregation Schema Changes - RHQ - Project Documentation Editor
-
5. Re: Data Model
john.sanda Apr 8, 2014 8:12 AM (in response to heiko.braun)Hi Heiko,
I will write up some additional docs within the next day or two that provide an overview. Aggregation Schema Changes - RHQ - Project Documentation Editor is a good reference, but it only covers schema changes being introduced in RHQ 4.11.
-
6. Re: Data Model
heiko.braun Apr 8, 2014 9:27 AM (in response to john.sanda)Thanks John. You've already done a great job of documenting the C* architecture and design.
I do not yet fully understand how the raw data is computed into aggregates and how it breaks down into column families. Maybe you can provide a high level overview how the data moves through the system and how this corresponds to the column families, rows and columns?
For instance what's the relation between the raw data, the index and buckets? I guess it's something like "values", "to be computed" and "results"?
-
7. Re: Re: Data Model
john.sanda Apr 8, 2014 10:26 AM (in response to heiko.braun)Let me provide an abridged version for now, and I will follow up with a more comprehensive doc. Since RHQ uses CQL, I will try to stick with CQL terminology as well. Things are a lot more involved with the changes described in https://docs.jboss.org/author/display/RHQ/Aggregation+Schema+Changes so the details I cover here will pertain more so to RHQ 4.9.
RHQ agents report raw metrics to the RHQ server which in turns stores the data in Cassandra. Every hour a job runs that aggregates raw data into 1 hour aggregate metrics. Every six hours, we generate 6 hour aggregate metrics using the 1 hour data as input. And every 24 hours we generate 24 hour aggregate metrics using the 6 hour aggregate metrics as input. There are separate tables corresponding to each of the raw, 1 hr, 6 hr, and 24 hr metrics.
Each raw datum that the agent reports corresponds to a measurement schedule which in turn is derived or realized from a measurement definition. The measurement definition is system-wide meta data that describes the measurement, e.g., units, trend up/down, etc. The measurement schedule is per resource. It specifies the collection interval used by the agent. The schedules are stored in the relational database. Each schedule has a unique id, and we use that id as the partition key for each of metrics tables in Cassandra. Note that many CQL rows can make up a single partition which in terms of physical storage layout is a single, wide row on disk.
The metrics_index table is a custom index that is used to determine which schedules have data to be aggregated for a given time slice. For example, suppose the aggregation job runs at 14:00. It will aggregate raw data that was collected during the 13:00 - 14:00 time slice. We query the index to get a set of all the schedule ids for which data was inserted during that time slice.
In the schema design doc, I used the term buckets to generically refer to the raw_metrics, one_hour_metrics, six_hour_metrics, and twenty_four_hour_metrics tables.
As an aside, most of the schema changes that went into RHQ 4.9 were made early on in the Cassandra 1.2.x time frame while a lot of CQL features were still being added. With CQL collections at our disposal, we can more easily store data for all the buckets in a single CQL table. I believe there may be some advantages for doing so, but of course further testing and analysis is needed.
Hope this helps.
-
8. Re: Data Model
heiko.braun Apr 8, 2014 11:08 AM (in response to john.sanda)Thanks John, that's explain it.
-
9. Re: Data Model
heiko.braun Apr 8, 2014 11:15 AM (in response to john.sanda)Each schedule has a unique id, and we use that id as the partition key for each of metrics tables in Cassandra. Note that many CQL rows can make up a single partition which in terms of physical storage layout is a single, wide row on disk.
So the schedule ID becomes the row key, right? Regarding "many rows, single partition": Can elaborate on this?
-
10. Re: Re: Data Model
john.sanda Apr 8, 2014 11:30 AM (in response to heiko.braun)Yes, the schedule id is the row key, or partition key in CQL terminology. An example might help. Here is some sample output from cqlsh,
cqlsh:rhq> select * from raw_metrics;
schedule_id | time | value
-------------+--------------------------+--------
100 | 2014-04-08 16:20:00-0400 | 3
100 | 2014-04-08 16:40:00-0400 | 5
100 | 2014-04-08 17:20:00-0400 | 11
100 | 2014-04-08 17:40:00-0400 | 16
131 | 2014-04-08 16:07:00-0400 | 3.14
101 | 2014-04-08 16:15:00-0400 | 0.0032
101 | 2014-04-08 16:30:00-0400 | 0.104
101 | 2014-04-08 17:30:00-0400 | 0.092
101 | 2014-04-08 17:45:00-0400 | 0.0733
Note that there are 4 (CQL) rows with schedule_id 100. Those 4 rows all live in the same partition having the partition/row key of 100. In terms of physical storage, it would look like,
Row Key Column name/value Column name/value Column name/value Column name/value 100 2014-04-08 16:20:00 2014-04-08 16:40:00 2014-04-08 17:20:00 2014-04-08 17:40:00 3 4 11 16 Hashing is done using the partition key. Each replica owns a partition which means a single node can satisfy a query like SELECT * FROM raw_metrics WHERE schedule_id = 100 AND time < '2014-04-08 17:00:00'.
-
11. Re: Re: Data Model
john.sanda Apr 8, 2014 11:36 AM (in response to john.sanda)The hashing remark might not have been clear. Let's say we have a 5 node cluster with a replication_factor of 3. This means that a copy of the partition/row with schedule id 100 will exist on 3 nodes. Those 3 nodes are the replica's for that partition.
-
12. Re: Data Model
heiko.braun Apr 8, 2014 12:43 PM (in response to john.sanda)Thanks. Now I see what you mean and at the same time I realize why people have difficultities bridging from CQL to the actual CF layout. But this example explains it perfectly.
-
13. Re: Re: Data Model
pilhuhn Apr 11, 2014 8:59 PM (in response to john.sanda)Am 08.04.2014 um 16:27 schrieb John Sanda <do-not-reply@jboss.com>:
In the schema design doc, I used the term buckets to generically refer to the raw_metrics, one_hour_metrics, six_hour_metrics, and twenty_four_hour_metrics tables.
Within RHQ we also sometimes use the term "Bucket" to denote one "bar" inside a metrics graph.
Our metrics graph by default have 60 such bars and the default display interval is 8h for the whole graph, so a bar corresponds to all the raw metrics that have been collected during a certain 8mins interval - so they fall into this bucket.