7 Replies Latest reply on Sep 30, 2014 11:59 AM by john.sanda

Ingestion Rates and Retention Periods

john.sanda Sep 29, 2014 10:49 PM

I have been thinking some about what kinds of ingestion rates we want to or need to support, partly because this could affect schema design. By ingestion rate, I guess I really mean the collection rate for a single metric. Let me talk a little about RHQ to better frame the discussion. Unlike RHQ Metrics, RHQ has only one metrics collector, the RHQ agent (let's set aside the RHQ REST API for the moment). Due to limitations on the agent side, the maximum collection frequency for any single metric is 30 seconds. RHQ has a fixed retention period of seven days for raw data. This works out to a maximum of 20,160 live values per partition. RHQ partitions data by measurement schedule id. Similarly, RHQ Metrics is partitioning data by metric id thus far. If we intend to support much faster ingestion rates per metric, then we may want to consider partitioning data differently. Suppose we support an ingestion rate of 1 second (again, this is the rate at which we consume data points for a single metric), and assume for the moment the same retention period of 7 days. That works out to a maximum of 604,800 live cells/values. Personally, I do not see any reason why we would not support or allow sub-second or even sub-millisecond ingestion rates. In fact Cassandra offers the TimeUUID data type since Java does not support sub-millisecond timestamp resolution. And I definitely think we want to allow for retention periods longer longer than 7 days. The numbers get pretty big pretty fast. Storing all of those data points in a single partition would be detrimental to performance.

Depending on the ingestion rates and retention periods we support, we might want to partition data by metric id and by month, or by week, or even by day for example. Doing so will not have any negative impact on write performance, but it will certainly make queries more complex as multiple reads will have to be performed with results being merged client side.

It would be nice if there was a "one size fits all solution", but I am skeptical that we will find that to be the case. I think it is more likely that we will need to dynamically adjust our partitioning based on ingestion rates and retention periods.

Is there a maximum ingestion rate that we should impose? What should we support, handle, plan for, etc. out of the box?

I have similar questions regarding retention periods, but I will address those in a separate thread because there are some other details that I want to discuss that do not pertain to ingestion rates.

1. Re: Ingestion Rates and Retention Periods

tsegismont Sep 30, 2014 3:52 AM (in response to john.sanda)

Could we have a schema such as partitioning by metric_id+day or
metric_idweek metric_idmonth or metric_id+year would all be supported
in parallel?

Suppose users are able to pre-define metrics with their specific
retention period and estimated ingestion rate. We would be able to
choose which of the models above would best fit.

Later, as users change the retention period, or as we detect a
significant increase/decrease in the ingestion rate, we may move the
metric data from one model to the other.

Is that possible?
Actions
2. Re: Ingestion Rates and Retention Periods

theute Sep 30, 2014 6:23 AM (in response to john.sanda)

First, from a user point of view, I am not convinced that we need a rate below a second.

There should be a limit to impose but this should be the limit imposed by the implementation (users should not be able to put down the server, AFAIK the 30s restriction in RHQ is because of that).
We may impose a rate limitation at the API level such as one can't ask the server to digest more than x data in y timeframe. For instance on a 1min timeframe we may say that we accept 60 datapoints, it could be one very second of nothing except a burst of 60 values within a second.
1 of 1 people found this helpful
Actions
3. Re: Ingestion Rates and Retention Periods

theute Sep 30, 2014 6:34 AM (in response to theute)

New Relic actually impose such a limitation on the API:
https://docs.newrelic.com/docs/plugins/plugin-developer-resources/developer-reference/plugin-api-specification#frequency

20'000 metrics per POST and no more than 2 POSTs per minute, so 40'000 metrics per minute max (per agent ?) "requests larger than this are subject to rejection or automatic data aggregation."
1 of 1 people found this helpful
Actions
4. Re: Ingestion Rates and Retention Periods

tsegismont Sep 30, 2014 7:46 AM (in response to theute)

I'm fine with a rate limitation (automatic aggregation sounds like a
good idea BTW) but not that low.

I really don't expect users to have tons of metrics requiring more than
60 points/minute. But I feel like if the door is closed for this use
case, they will look somewhere else.
Actions
5. Re: Ingestion Rates and Retention Periods

theute Sep 30, 2014 8:02 AM (in response to tsegismont)

Number was just to illustrate.
Actions
6. Re: Ingestion Rates and Retention Periods

john.sanda Sep 30, 2014 11:52 AM (in response to tsegismont)

Thomas Segismont wrote:

Could we have a schema such as partitioning by metric_id+day or

metric_idweek metric_idmonth or metric_id+year would all be supported

in parallel?

Suppose users are able to pre-define metrics with their specific

retention period and estimated ingestion rate. We would be able to

choose which of the models above would best fit.

Later, as users change the retention period, or as we detect a

significant increase/decrease in the ingestion rate, we may move the

metric data from one model to the other.

Is that possible?

We could support multiple partitioning schemes in parallel. It would require storing additional meta data so that we can track what date ranges are covered by a given partition.
Actions
7. Re: Ingestion Rates and Retention Periods

john.sanda Sep 30, 2014 11:59 AM (in response to theute)

The New Relic info is interesting and helpful. I like the idea of imposing limits at the API level. Right now, I am more concerned about the schema and whether not changes might be necessary. It sounds like we may want some additional partitioning; however, it seems like it would be premature to start adding in support for multiple partitioning schemes in parallel. That can come in the future if and when necessary.
Actions

Go to original post