1 Reply Latest reply on Aug 2, 2012 4:42 AM by sannegrinovero

data modeling with time series data

john.sanda Aug 1, 2012 10:29 PM

I have started exploring Infinispan for managing time series data in RHQ. Since I am new to Infinispan, I would love to get some feedback on the following. The data involves various types of metrics. The data is basically a tuple - {schedule_id, timestamp, value}. Collections occur at most every 30 seconds, which means I can have 120 data points per hour. I need to support date range queries. The RHQ agent sends a metrics report to the RHQ server which then persists the data. The report consists of N data points or tuples. The schedule ids in a give metrics report will be unique as well. I have considered some different approaches for how I might store the data in the cache.

Approach 1: Store 1 cache entry per data point

This approach should make storing the data fast and efficient which is important as this is write intensive. This approach does not seem to lend itself well to reads. Assuming for the moment that I know the keys, I could do a separate get() call for each data point which is no good. I am reluctant to use the search module out of concern that the overhead added by the Lucence indexes would degrade write performance. This approach, where I stored the data at a fined-grained level in the cache, may satisfy my requirements for write performance, but I don't think it holds up well for queries.

Approach 2: Store data points in a more coarse-grained data structure

Instead of storing one data point/tuple per cache entry, I could store all the data reported in a given hour in a collection such as a sorted set. The key for the cache entry would consist of the schedule id and the timestamp rounded down to the start of the hour. Maybe I could even the cache entry even more granular, storing 2 hours worth of data per entry. Clients can easily determine keys for a date range and use substantially fewer get() calls to retrieve the data in that range. There is additional overhead though when writing the data into the cache. If I have N data points, then I need to perform N reads in order to insert/update the cache entry. The more coarse-grained cache entries helps with getting a time slice of data, but having to do a read/get before every put won't scale.

Approach 3: Store 1 data point per cache entry and use distributed updates to generate coarse-grained entries

This is somewhat of a hybrid approach which I think has potential. Like the first approach, I store 1 data point per cache entry. After all the data in a metrics report is persisted, I use the DistributedExecutorService to update my cache. My DistributedCallable will create or update the more coarse-grained cache entries described in approach 2. This seems like a win win since I don't incur the overhead of having to do any reads when I write the metrics data into the cache, and I still am able to get at a time slice of data reasonable fast.

Is this a reasonable/sound approach?

Is this a good use of DistributedExectorService?

Are there are other ways I might want to consider storing data that would make the time slice queries effecient without sacrificing write performance?

Thanks

- John

1. Re: data modeling with time series data

sannegrinovero Aug 2, 2012 4:42 AM (in response to john.sanda)

Hi John,
I agree on your concerns for approaches 1 and 2, 3 looks like the best option. You could even use different caches, so that each DistributedExector run knows which cache to scan, for example the one having single data points per entry, and stores aggregation outputs in a different cache; that would enable you to do some clever per-cache tuning and avoid iterating over non-interesting entries.
Assuming most queries are known, consider caching the query results directly in a dedicated cache; you could pre-compute them periodically.

Only warning is that the DistributedExectorService is being updated significantly, the API will change a bit.

Sanne
1 of 1 people found this helpful
Actions