6 Replies Latest reply on Apr 18, 2014 5:43 AM by pilhuhn

Pre-computed aggregates vs. ad-hoc queries

heiko.braun Apr 8, 2014 5:06 AM

Where do RHQ metrics fit into the picture?

1. Re: Pre-computed aggregates vs. ad-hoc queries

pilhuhn Apr 8, 2014 6:01 AM (in response to heiko.braun)

In the General idea document I have pictured the batched aggregations as a default service. One reason is that for long term storage we want to reduce the amount of data to be stored and thus only keep tuples min/avg/max for time slices of 1h, 6h, 1d, which all time out after a certain time. On top of that the raw data is kept for a certain amount of time (7days currently in RHQ).

When you now query data from the last 8h you can get the raw data , while if you look at data from 9month ago, you will get the daily avg tuples.

On top of that we have in RHQ currently the notion of buckets, where the UI always displays 60 slots of data no matter what timespan you look at. So here we have a service that automatically computes 60 min/avg/max tuples for a metric of the the given timespan. This is done ad-hoc at the moment when you request the data.
Actions
2. Re: Re: Pre-computed aggregates vs. ad-hoc queries

john.sanda Apr 8, 2014 9:58 AM (in response to pilhuhn)

RHQ definitely utilizes pre-computed aggregates as Heiko Rupp pointed out. The bucketing that is mentioned is done in real time when the request is made. The computation of the chosen data set (1 hour, 6 hour, 1 day) is not done in real time. That is done in a batch processing job that runs hourly. The following table shows how we determine which data set to use.

Date Range* Metrics
< 7 days raw data
< 14 days 1 hour data
< 31 days 6 hour data
>= 31 days 1 day data

* For simplicity assume that the upper bound is now.

Are there other types of aggregates that we might want to pre-compute? With respect to ad-hoc queries, I guess I need an example or two to see where it might make sense. And of course with Cassandra, ad-hoc query capabilities are very limited.
Actions
3. Re: Pre-computed aggregates vs. ad-hoc queries

heiko.braun Apr 8, 2014 11:05 AM (in response to john.sanda)

Does the data expire along these lines? I.e. after 7 days the corresponding raw data is purged?
Actions
4. Re: Pre-computed aggregates vs. ad-hoc queries

john.sanda Apr 8, 2014 11:11 AM (in response to heiko.braun)

Yes this does correspond to retention periods except that 1 day/24 hr data has a retention of one year. We expire data using Cassandra's TTL feature. One of the changes in Cassandra 2.0 is being to use bind variables for the TTL in prepared statements. This change will make it much easier to support dynamically configurable retention periods.
Actions
5. Re: Re: Pre-computed aggregates vs. ad-hoc queries

pilhuhn Apr 11, 2014 8:59 PM (in response to john.sanda)

Am 08.04.2014 um 15:58 schrieb John Sanda <do-not-reply@jboss.com>:

That is done in a batch processing job that runs hourly. The following table shows how we determine which data set to use.

Date Range*     Metrics
< 7 days     raw data
< 14 days     1 hour data
< 31 days     6 hour data
>= 31 days     1 day data

Interesting question here is e.g. if the user requests data for day6-day8, so the first data in the requested set could be satisified by the raw table, while the 2nd day can not.
Of course if we use 60 display buckets, then the additional precision from using the raw data is not needed.
But this later argument can be already applied for 2d of data that is from 2d ago until now.

This changes IMO if the user wants to process the data in an external system where the highest possible precision is needed and for a request of the data for the most recent 7 days should come from the raw table and for the next 7 days from the 1h table.

(in other words: we should not rely on the current 60 display buckets to be always true for rhq-metrics)
Actions
6. Re: Re: Pre-computed aggregates vs. ad-hoc queries

pilhuhn Apr 18, 2014 5:43 AM (in response to john.sanda)

John Sanda schrieb:

Yes this does correspond to retention periods except that 1 day/24 hr data has a retention of one year. We expire data using Cassandra's TTL feature. One of the changes in Cassandra 2.0 is being to use bind variables for the TTL in prepared statements. This change will make it much easier to support dynamically configurable retention periods.

We had in our Summit session also a participant that was asking if we could shorten the retention period to e.g. 30 days only.
I think we need to take this into account and e.g in this case drop (=not compute) the 1h aggregates totally.

We should here not only allow to shorten the total amount of time, but also e.g. allow to keep raw data for 30 days and only then fall back to using the 1h,... aggregates
Actions

Date Range*	Metrics
< 7 days	raw data
< 14 days	1 hour data
< 31 days	6 hour data
>= 31 days	1 day data

Go to original post