6 Replies Latest reply on Apr 18, 2014 5:43 AM by pilhuhn

    Pre-computed aggregates vs. ad-hoc queries

    heiko.braun

      Where do RHQ metrics fit into the picture?

        • 1. Re: Pre-computed aggregates vs. ad-hoc queries
          pilhuhn

          In the General idea document I have pictured the batched aggregations as a default service. One reason is that for long term storage we want to reduce the amount of data to be stored and thus only keep tuples min/avg/max for time slices of 1h, 6h, 1d, which all time out after a certain time. On top of that the raw data is kept for a certain amount of time (7days currently in RHQ).

           

          When you now query data from the last 8h you can get the raw data , while if you look at data from 9month ago, you will get the daily avg tuples.

           

          On top of that we have in RHQ currently the notion of buckets, where the UI always displays 60 slots of data no matter what timespan you look at. So here we have a service that automatically computes 60 min/avg/max tuples for a metric of the the given timespan. This is done ad-hoc at the moment when you request the data.

          • 2. Re: Re: Pre-computed aggregates vs. ad-hoc queries
            john.sanda

            RHQ definitely utilizes pre-computed aggregates as Heiko Rupp pointed out. The bucketing that is mentioned is done in real time when the request is made. The computation of the chosen data set (1 hour, 6 hour, 1 day) is not done in real time. That is done in a batch processing job that runs hourly. The following table shows how we determine which data set to use.

             

            Date Range*Metrics
            < 7 daysraw data
            < 14 days1 hour data
            < 31 days6 hour data
            >= 31 days1 day data

             

            * For simplicity assume that the upper bound is now.

             

             

            Are there other types of aggregates that we might want to pre-compute? With respect to ad-hoc queries, I guess I need an example or two to see where it might make sense. And of course with Cassandra, ad-hoc query capabilities are very limited.

            • 3. Re: Pre-computed aggregates vs. ad-hoc queries
              heiko.braun

              Does the data expire along these lines? I.e. after 7 days the corresponding raw data is purged?

              • 4. Re: Pre-computed aggregates vs. ad-hoc queries
                john.sanda

                Yes this does correspond to retention periods except that 1 day/24 hr data has a retention of one year. We expire data using Cassandra's TTL feature. One of the changes in Cassandra 2.0 is  being to use bind variables for the TTL in prepared statements. This change will make it much easier to support dynamically configurable retention periods.

                • 5. Re: Re: Pre-computed aggregates vs. ad-hoc queries
                  pilhuhn

                  Am 08.04.2014 um 15:58 schrieb John Sanda <do-not-reply@jboss.com>:

                   

                  That is done in a batch processing job that runs hourly. The following table shows how we determine which data set to use.

                   

                   

                  Date Range*     Metrics

                  < 7 days     raw data

                  < 14 days     1 hour data

                  < 31 days     6 hour data

                  >= 31 days     1 day data

                   

                   

                  Interesting question here is e.g. if the user requests data for day6-day8, so the first data in the requested set could be satisified by the raw table, while the 2nd day can not.

                  Of course if we use 60 display buckets, then the additional precision from using the raw data is not needed.

                  But this later argument can be already applied for 2d of data that is from 2d ago until now.

                   

                  This changes IMO if the user wants to process the data in an external system where the highest possible precision is needed and for a request of the data for the most recent 7 days should come from the raw table and for the next 7 days from the 1h table.

                   

                  (in other words: we should not rely on the current 60 display buckets to be always true for rhq-metrics)

                  • 6. Re: Re: Pre-computed aggregates vs. ad-hoc queries
                    pilhuhn

                    John Sanda schrieb:

                     

                    Yes this does correspond to retention periods except that 1 day/24 hr data has a retention of one year. We expire data using Cassandra's TTL feature. One of the changes in Cassandra 2.0 is  being to use bind variables for the TTL in prepared statements. This change will make it much easier to support dynamically configurable retention periods.

                     

                    We had in our Summit session also a participant that was asking if we could shorten the retention period to e.g. 30 days only.

                    I think we need to take this into account and e.g in this case drop (=not compute) the 1h aggregates totally.


                    We should here not only allow to shorten the total amount of time, but also e.g. allow to keep raw data for 30 days and only then fall back to using the 1h,... aggregates