7 Replies Latest reply on Sep 30, 2014 11:59 AM by john.sanda

    Ingestion Rates and Retention Periods

    john.sanda

      I have been thinking some about what kinds of ingestion rates we want to or need to support, partly because this could affect schema design. By ingestion rate, I guess I really mean the collection rate for a single metric. Let me talk a little about RHQ to better frame the discussion. Unlike RHQ Metrics, RHQ has only one metrics collector, the RHQ agent (let's set aside the RHQ REST API for the moment). Due to limitations on the agent side, the maximum collection frequency for any single metric is 30 seconds. RHQ has a fixed retention period of seven days for raw data. This works out to a maximum of 20,160 live values per partition. RHQ partitions data by measurement schedule id. Similarly, RHQ Metrics is partitioning data by metric id thus far. If we intend to support much faster ingestion rates per metric, then we may want to consider partitioning data differently. Suppose we support an ingestion rate of 1 second (again, this is the rate at which we consume data points for a single metric), and assume for the moment the same retention period of 7 days. That works out to a maximum of 604,800 live cells/values. Personally, I do not see any reason why we would not support or allow sub-second or even sub-millisecond ingestion rates. In fact Cassandra offers the TimeUUID data type since Java does not support sub-millisecond timestamp resolution. And I definitely think we want to allow for retention periods longer longer than 7 days. The numbers get pretty big pretty fast. Storing all of those data points in a single partition would be detrimental to performance.

       

      Depending on the ingestion rates and retention periods we support, we might want to partition data by metric id and by month, or by week, or even by day for example. Doing so will not have any negative impact on write performance, but it will certainly make queries more complex as multiple reads will have to be performed with results being merged client side.

       

      It would be nice if there was a "one size fits all solution", but I am skeptical that we will find that to be the case. I think it is more likely that we will need to dynamically adjust our partitioning based on ingestion rates and retention periods.

       

      Is there a maximum ingestion rate that we should impose? What should we support, handle, plan for, etc. out of the box?

       

      I have similar questions regarding retention periods, but I will address those in a separate thread because there are some other details that I want to discuss that do not pertain to ingestion rates.

        • 1. Re: Ingestion Rates and Retention Periods
          tsegismont

          Could we have a schema such as partitioning by metric_id+day or

          metric_idweek metric_idmonth or metric_id+year would all be supported

          in parallel?

           

          Suppose users are able to pre-define metrics with their specific

          retention period and estimated ingestion rate. We would be able to

          choose which of the models above would best fit.

           

          Later, as users change the retention period, or as we detect a

          significant increase/decrease in the ingestion rate, we may move the

          metric data from one model to the other.

           

          Is that possible?

          • 2. Re: Ingestion Rates and Retention Periods
            theute

            First, from a user point of view, I am not convinced that we need a rate below a second.

             

            There should be a limit to impose but this should be the limit imposed by the implementation (users should not be able to put down the server, AFAIK the 30s restriction in RHQ is because of that).

            We may impose a rate limitation at the API level such as one can't ask the server to digest more than x data in y timeframe. For instance on a 1min timeframe we may say that we accept 60 datapoints, it could be one very second of nothing except a burst of 60 values within a second.

            1 of 1 people found this helpful
            • 3. Re: Ingestion Rates and Retention Periods
              theute

              New Relic actually impose such a limitation on the API:

              https://docs.newrelic.com/docs/plugins/plugin-developer-resources/developer-reference/plugin-api-specification#frequency

               

              20'000 metrics per POST and no more than 2 POSTs per minute, so 40'000 metrics per minute max (per agent ?) "requests larger than this are subject to rejection or automatic data aggregation."

              1 of 1 people found this helpful
              • 4. Re: Ingestion Rates and Retention Periods
                tsegismont

                I'm fine with a rate limitation (automatic aggregation sounds like a

                good idea BTW) but not that low.

                 

                I really don't expect users to have tons of metrics requiring more than

                60 points/minute. But I feel like if the door is closed for this use

                case, they will look somewhere else.

                • 5. Re: Ingestion Rates and Retention Periods
                  theute

                  Number was just to illustrate.

                  • 6. Re: Ingestion Rates and Retention Periods
                    john.sanda

                    Thomas Segismont wrote:

                     

                    Could we have a schema such as partitioning by metric_id+day or

                    metric_idweek metric_idmonth or metric_id+year would all be supported

                    in parallel?

                     

                    Suppose users are able to pre-define metrics with their specific

                    retention period and estimated ingestion rate. We would be able to

                    choose which of the models above would best fit.

                     

                    Later, as users change the retention period, or as we detect a

                    significant increase/decrease in the ingestion rate, we may move the

                    metric data from one model to the other.

                     

                    Is that possible?

                    We could support multiple partitioning schemes in parallel. It would require storing additional meta data so that we can track what date ranges are covered by a given partition.

                    • 7. Re: Ingestion Rates and Retention Periods
                      john.sanda

                      The New Relic info is interesting and helpful. I like the idea of imposing limits at the API level. Right now, I am more concerned about the schema and whether not changes might be necessary. It sounds like we may want some additional partitioning; however, it seems like it would be premature to start adding in support for multiple partitioning schemes in parallel. That can come in the future if and when necessary.