3 Replies Latest reply on Apr 9, 2014 10:09 AM by john.sanda

    Aggregation schema changes

    heiko.braun

      Some questions regarding the proposed chnages in Aggregation Schema Changes - RHQ - Project Documentation Editor

       

      During aggregation, a single query is executed against metrics_index to obtain all of the schedule ids with raw data to be aggregated. Then for each schedule id, a query is executed against raw_metrics to fetch the data to be aggregated into 1 hour metrics.


      If the query against metrics_index returns N measurement schedule ids, then N + 1 queries are executed to aggregate raw data

       

      I understand that N+1 queries are not desirable and can cause a significant perfoermance overhead. But I am wondering why N+1 queries are needed at all. If we know what time_slice to process next (index tables), isn't it possible to do a single query across all rows and select all columns within a specific time span?

       

      It's probably my limited C* knowledge, but maybe some can shed some light on this?

       

       

       

       


        • 1. Re: Aggregation schema changes
          john.sanda

          Doing that single query would be equivalent to a distributed full table scan across all cluster nodes. For any appreciable amount of data, we would probably see a lot of timeouts. The coordinator node, the node that receives the client request, might time out waiting for other nodes to reply. And the client could time out waiting for the coordinator because it does not send the response until it has the full results. When lots of data has to be fetched, it can be a lot more efficient to execute multiple queries in parallel and start processing results as they arrive instead of waiting for the full result set.

           

          The most efficient type of query is one executed against a single partition. The request can be serviced by a single replica and will requires the fewest number of disk seeks.

           

          There is one other aspect to consider. Not all schedules might have data to process for the given time slice which could result in a lot of unnecessary reads.

          • 2. Re: Aggregation schema changes
            heiko.braun
            The most efficient type of query is one executed against a single partition.

             

             

            I see what what you mean. That makes sense.

             

            Just out of curiosity, what would it look like if we had time slices as row keys and schedule id's as columns keys? This would restrict all queries for a certain time slice to a single partition, right? I am currently just brainstorming and trying to understand the benefits and drawbacks of each approach. But since have a lot more experience with the C* data model layout, I am wondering what your thoughts are?

            • 3. Re: Aggregation schema changes
              john.sanda

              Great question. Let's call the number of schedule ids N. If N is < 100,000 this might work out well. But as N increases in size, the query is going to get more resource intensive in terms of heap usage. This could also lead to hot spots, meaning writes for the time slice are not distributed across the cluster. To avoid these problems, we would probably want to partition the row. There are different ways of doing it. One is to append an integer such that keys would look like { 2014-04-09 03:00:00:0}, { 2014-04-09 03:00:00:1}, { 2014-04-09 03:00:00:2}, etc. On writes we could alternate the partition randomly or in a round robin fashion. Another approach might be to divide the time slice into smaller units; so instead of { 2014-04-09 03:00:00} as a key, we could divide into 4 where we would have { 2014-04-09 03:00:00}, { 2014-04-09 03:15:00}, etc. Of course you need to keep track of the number of partitions so that you know how many queries to execute. As far as how many partitions are needed, it depends on the number schedules, the resources allocated to each node (JVM heap, RAM on the box, etc.), and number of nodes.

               

              Once nice thing is that it is easy to dynamically increase or decrease the number of partitions. If we have an efficient way to track the number of schedule ids, we could use that information to decide on whether or not more or less partitioning is needed. Whenever we query the time slice, we can make future adjustments based on the query results. One challenge I see that data for a schedule id could be spread across multiple partitions which means we would need to do in an in-memory sort. Without thinking about it some more, I am not sure how this could be done without loading all of the data for the time slice into memory.

               

              With the changes described in Aggregation Schema Changes - RHQ - Project Documentation Editor storing all data for a time slice in a single partition but for a subset of schedules. All of the data for those schedules is co-located within that single partition.