data modeling with time series data
john.sanda Aug 1, 2012 10:29 PMI have started exploring Infinispan for managing time series data in RHQ. Since I am new to Infinispan, I would love to get some feedback on the following. The data involves various types of metrics. The data is basically a tuple - {schedule_id, timestamp, value}. Collections occur at most every 30 seconds, which means I can have 120 data points per hour. I need to support date range queries. The RHQ agent sends a metrics report to the RHQ server which then persists the data. The report consists of N data points or tuples. The schedule ids in a give metrics report will be unique as well. I have considered some different approaches for how I might store the data in the cache.
Approach 1: Store 1 cache entry per data point
This approach should make storing the data fast and efficient which is important as this is write intensive. This approach does not seem to lend itself well to reads. Assuming for the moment that I know the keys, I could do a separate get() call for each data point which is no good. I am reluctant to use the search module out of concern that the overhead added by the Lucence indexes would degrade write performance. This approach, where I stored the data at a fined-grained level in the cache, may satisfy my requirements for write performance, but I don't think it holds up well for queries.
Approach 2: Store data points in a more coarse-grained data structure
Instead of storing one data point/tuple per cache entry, I could store all the data reported in a given hour in a collection such as a sorted set. The key for the cache entry would consist of the schedule id and the timestamp rounded down to the start of the hour. Maybe I could even the cache entry even more granular, storing 2 hours worth of data per entry. Clients can easily determine keys for a date range and use substantially fewer get() calls to retrieve the data in that range. There is additional overhead though when writing the data into the cache. If I have N data points, then I need to perform N reads in order to insert/update the cache entry. The more coarse-grained cache entries helps with getting a time slice of data, but having to do a read/get before every put won't scale.
Approach 3: Store 1 data point per cache entry and use distributed updates to generate coarse-grained entries
This is somewhat of a hybrid approach which I think has potential. Like the first approach, I store 1 data point per cache entry. After all the data in a metrics report is persisted, I use the DistributedExecutorService to update my cache. My DistributedCallable will create or update the more coarse-grained cache entries described in approach 2. This seems like a win win since I don't incur the overhead of having to do any reads when I write the metrics data into the cache, and I still am able to get at a time slice of data reasonable fast.
Is this a reasonable/sound approach?
Is this a good use of DistributedExectorService?
Are there are other ways I might want to consider storing data that would make the time slice queries effecient without sacrificing write performance?
Thanks
- John