3 Replies Latest reply on Jan 16, 2015 11:03 AM by john.sanda

    Do we need pre-computed aggregates?


      I want to first provide some terminology and background to help frame the discussion.


      When querying time series data, resolution refers to the number of data points for a given time range. The highest resolution would provide every available data point for a time range. So if I want a query to use the highest resolution and if there are 100 data points, then the query result should include every one of those 100 points.


      As the number of data points increases, providing results at higher resolutions becomes less effective. For instance, increasing the resolution to the point where a graph in the UI includes 1 million points is probably no more effective than if the graph included only 10,000 or even 1,000 data points. The higher resolution could degrade user experience as rendering time increases. Latency on server response time is also likely to increase.


      Downsampling is a technique to provide data at lower resolution. It may involve applying one or more aggregation functions to the time series data across some discrete number of intervals where the sum of intervals' durations equals the duration of the original date range. It should be noted that downsampling is done at query time.


      Downsampling is a necessary technique for dealing with high resolution data, but is it sufficient. There are a couple of issue to take under consideration. First, the process itself can become CPU-intensive to the point where it increases latency on response times. Secondly there is the issue of storage. Suppose we lengthen the date range on our queries such that it spans 1 trillion data points. Whether it is 1 million or 1 trillion, at some point storing that many data points for our metrics becomes cost prohibitive.


      Pre-computed aggregation is the process of continually downsampling a time series and storing the lower resolution data for future analysis or processing. Pre-computed aggregates are often combined with data expiration/retention policies to address the aforementioned storage problem. Higher resolution data is stored for shorter periods of time than lower resolution data. Pre-computed aggregation can also alleviate the CPU utilization and latency problems. Instead of downsampling 1 million data points, we can query the pre-computed aggregated data points and perform downsampling on 10,000 data points.


      The following table provides a summary of a few systems and their support (or lack thereof) for downsampling and pre-computed aggregation.


      SystemSupports downsampling?Supports pre-computed aggregates?Pre-computed aggregation configurable?


      There are plenty of other systems that could be included in this table, but this is enough to be a representative for this discussion at one. At one end of the spectrum we have OpenTSDB that provides no support for pre-computed aggregates. It would have to be completely handled by the client. OpenTSDB only store data at its original resolution. RHQ is at the other end of the spectrum in that it does provide pre-computed aggregates, but it is in no way configurable. There is no way for example to specify for which metrics you want or do not want to generate pre-computed aggregates. Nor is there a way to specify the intervals at which the downsampling is performed. InfluxDB falls in the middle of spectrum in that it does provide pre-computed aggregates which it calls continuous queries in its documentation, and they are configurable. You can define the continuous queries that you want along with the intervals and functions to use.


      Back to the original question. Does RHQ Metrics need to provide pre-computed aggregates? I believe that the answer is yes, but it has to be configurable. If we have metrics with retentions as short as a week or even a month, the costs of pre-computed aggregates may outweigh the benefits. But for other metrics whose data we want to keep around longer, possibly indefinitely, then pre-computed aggregates make a lot more sense.

        • 1. Re: Do we need pre-computed aggregates?

          +1 to pre-computed aggregated data, agreed that it goes in conjunction with retention.


          Here are few usecases I have in mind:

          • Free trial in a SaaS environment, you want users to give a try to your product and store metrics for them, it needs to be "cheap". Cheap in CPU and disk space. Offering 24H retention with no pre-aggregation in this case
          • CPU usage monitoring, in an ideal world you would want a fine granularity and data kept forever (just in case). In reality because of disk usage cost one would be ready to have a tradeoff, it will vary per user. Some will be ready to pay for 1 year of raw data and discard anything older, some may want 2 weeks of raw data but only keep the min/max/average tuple for 5 years...
          • For a metric that doesn't change frequently (like a max DB pool), it should be completely unnecessary to get data on a few seconds basis. Ideally we would just want to have a datapoint when the value has changed (with the precision of the sampling). This case is particularly interesting as this could be an "event" if Wildfly can notify the system of a change instead of pulling the data constantly, it would give a better precision without having to frequently poll Wildfly.
          • 2. Re: Do we need pre-computed aggregates?

            Can we please first discuss use cases for the pre-computed aggregates? We should not implement them just because RHQ has them.

            I am not saying that they are not useful, but we should try to better understand what we get from them, when they are used (e.g. you have 1h, 6h, 1d avg and 7 days raw retention. What data from what aggregate (raw, 1d, 6h,..) is used for e.g. a display range of 15days ago to 6days ago; this clearly spans the whole range.

            Also do we know that min/max/avg is really what we want? What about 99%ile or such?

            For older data when we keep only 1d data, the precision of what happens inside and outside business hours is totally lost.

            For a cpu value, one could potentially just return 0%,50%,100% and not be off from reality (joking)

            As Thomas indicated, we may try to find better ways to reduce data volume without compromising on precision.

            I want to try to expose my concerns a bit more with the following examples.

            Suppose you have 24 hours of data with exactly 100 data points that go like

            data <- numeric(100)

            data[10] <-1000

            data[20] <-1000

            data[50] <-1000

            data[60] <-1000

            data[80] <-1000

              then min=0, max=1000, avg=50. While this is mathematcially correct, those values do not transport any information about the shape of the data, which looks like this:

            Bildschirmfoto 2015-01-16 um 11.22.35.png

            green is the mean, blue the avg and orange is the standard deviation (sd=219) of the data

            When I now change the input data to

            data <- numeric(100)

            data[10] <-1000

            data[20] <-500

            data[50] <-500

            data[60] <-500

            data[80] <-500

            data[11] <-500

            data[21] <-500

            data[51] <-500

            data[61] <-500

            The min/max/avg/median stay the same, but the look is totally different

            Bildschirmfoto 2015-01-16 um 11.29.38.png

            ( standard deviation is down to 166 )

            • 3. Re: Do we need pre-computed aggregates?

              Can we please first discuss use cases for the pre-computed aggregates? We should not implement them just because RHQ has them.

              I am not saying that they are not useful, but we should try to better understand what we get from them

              Did you read my initial post? If not, please do so. If you did read it, please go back and read it again I clearly explain the motivations for pre-computed aggregates. I cite RHQ as an example for reference, but nowhere do I say or suggest that because RHQ generates pre-computed aggregates that RHQ Metrics should as well. The main issue I point out about RHQ is the lack of configurability. Every time series database I have looked at supports pre-computed aggregates (aka rollups). Some do it on the server side, and some require the client to do it.


              As Thomas indicated, we may try to find better ways to reduce data volume without compromising on precision.

              I am open to suggestions. Consider this example. Suppose we sample a metric every 5 minutes. That results in 12 data points per hour. Now I want to query for data from the past 6 months. That results in about 52,584 data points. We would not want to return all of those data points to the UI, so we are probably going to perform some downsampling and aggregation. With that many data points involved, we are likely to see higher latencies that may in turn have a negative impact on user experience. We could persist the computations so that subsequent queries do not incur the performance hit, but that does not deal with the data growth on disk in general. It only address those metrics that you are querying at a particular point in time.