6 Replies Latest reply on Nov 18, 2014 5:04 AM by tsegismont

    Storing Multiple Metrics in a Single Time Series

    john.sanda

      There has been discussion recently about storing multiple metrics in a single time series. I was talking about this yesterday with tsegismont while we were looking at some sample output from cadvisor that eventually gets stored in InfluxDB. Here is a simpler example from the InfluxDB docs that illustrates storing multiple metrics in a time series,

       

      [
        {
         
      "name" : "response_times",
         
      "columns" : ["code", "value", "controller_action"],
         
      "points" : [
           
      [200, 234, "users#show"]
         
      ]
       
      }
      ]

       

      If we wanted to transpose this into a format supported by RHQ Metrics, we would have three separate metrics like response_times.code, response_times.value, and response_times.controller_action. This is fine for writes, but it does incur some overhead for reads. Three separate queries are needed since each metric is stored in its own partition. If we are usually querying this data together, we want to optimize for that read path, ideally reading from a single partition.

       

      Here is a simplified version of the data table in the schema-changes branch,

       

      CREATE TABLE data (
        metric text,
        time timeuuid,
        attributes map<text, text> static,
        value double,
        PRIMARY KEY (metric, time)
      );
      

       

      Yesterday's discussion got me thinking about something I had previously considered,

       

      CREATE TABLE grouped_data (
        group text,
        metric text,
        time timeuuid,
        attributes map<text, text> static,
        value double,
        PRIMARY KEY (group, time, metric)
      );
      

       

      With this schema, response_times would be the group, and code, value, and controller_action would be the metric columns (we will ignore for the moment that controller_action is text and not numeric).

       

      INSERT INTO data (group, time, metric, value) VALUES ('response_times', now(), 'code', 200);
      INSERT INTO data (group, time, metric, value) VALUES ('response_times', now(), 'value', 234);
      

       

      Now we can store the data points with a single write using a batch update, and we can fetch the data by reading from only a single partition. After giving this some thought, I wound up with some concerns that makes me think a different approach (that I think tsegismont was suggesting) would be better. Here is the simplified schema,

       

      CREATE TABLE data (
        metric text,
        time timeuuid,
        attributes map<text, text> static,
        value double,
        values map<text, double>,
        PRIMARY KEY (metric, time)
      );
      

       

      The only difference here from the first data table definition is the addition of the values map. If we want to store multiple metrics within a single time series, then we write to the values column instead of the value column. Let's consider cpu usage metrics as an example. I am collecting cpu usage for my 4 cores. Since I will likely collect, write, and read this data together as a group, it is a perfect candidate for the values map. Here is an example of inserting data,

       

      INSERT INTO data (metric, time, values) VALUES ('myserver.cpu', now(), {
        'cpu0': 100,
        'cpu1': 100,
        'cpu2': 100,
        'cpu3: 100
      });
      

       

      I am still working through some of the details on how to expose this in the APIs. There are two scenarios I am focusing on right now. First, from the outset I know that I want to group metric data in a single time series. There needs to be something in the APIs (both REST and Java) to indicate that the metrics should be grouped. In the second scenario, I have some existing metrics that are not grouped, and I decided that I want to group them. We will need to expose a "grouping" function which creates a new time series. In this case, I think we would keep the original metric time series and create the new group one which would be implicitly updated any time the server stores data for one of the original metrics. The nice things here is that you still have the ability to query the individual metrics.

        • 1. Re: Storing Multiple Metrics in a Single Time Series
          tsegismont

          Can you give some details why you think the second approach is better? I believe that the second one is more "natural" for the purpose of grouping values in a single time-series, but maybe there's a more technical reason?

           

          About the second approach:

           

          Do you have any idea how we'd be able to use the map data type for storing any type of values (not only doubles)? Maybe by creating multiple columns: values_double map<text,double>, values_text map<text,text>?

           

          Is there a good reason to keep the value double column? When a user creates a new time-series with just one column (as in Influx terminology), couldn't we simply add the date in the relevant map column (in this case, values_double)?

           

          When we talked about this I had in mind that storing named values could be the default for any time-series. It would allow to add new columns to the time-series very easily. And storing only one double value would just be a specific case.

          • 2. Re: Storing Multiple Metrics in a Single Time Series
            tsegismont

            For the record, Cassandra does not yet allow to fetch subset of collections, but it's planned.

            • 3. Re: Storing Multiple Metrics in a Single Time Series
              john.sanda

              The first approach is actually much more flexible and more natural when you consider that the purpose of clustering columns is grouping. I am concerned though because it always forces the grouping. I really look at this as an optimization which may be premature. I see the second approach as safer since it makes the optimization optional.

               

              My examples left out additional columns for simplicity. The actual data table will include additional columns to support pre-computed aggregates, availability, and probably log event data.

               

              You could use blobs or user defined types (UDTs) for storing different types of values. The blob type would provide the most flexibility but requires the client to handle serializing and deserializing values. What use cases do you have in mind?

               

              Thomas Segismont wrote:

               

              Is there a good reason to keep the value double column? When a user creates a new time-series with just one column (as in Influx terminology), couldn't we simply add the date in the relevant map column (in this case, values_double)?

               

              When we talked about this I had in mind that storing named values could be the default for any time-series. It would allow to add new columns to the time-series very easily. And storing only one double value would just be a specific case.

               

              The value column is a double because we are primarily interested in numeric metrics. We could add additional columns (or tables if more appropriate) for string and value types if necessary. How likely do you think it is that clients will want to send data points that consist of both numeric and string data?

               

              We cannot add the date/timestamp to the map column. The timestamp is a clustering column. Maybe you are confusing Cassandra and InfluxDB terminology? We will support naming data points via tagging.

              • 4. Re: Re: Storing Multiple Metrics in a Single Time Series
                tsegismont
                The first approach is actually much more flexible and more natural when you consider that the purpose of clustering columns is grouping. I am concerned though because it always forces the grouping. I really look at this as an optimization which may be premature. I see the second approach as safer since it makes the optimization optional.

                Yeah I guess it still looks more natural to me because of my relational thinking...

                My examples left out additional columns for simplicity. The actual data table will include additional columns to support pre-computed aggregates, availability, and probably log event data.

                Ok, just wanted to be sure.

                You could use blobs or user defined types (UDTs) for storing different types of values. The blob type would provide the most flexibility but requires the client to handle serializing and deserializing values. What use cases do you have in mind?

                I think double, bigint and strings are enough. I was thinking about such a model:

                CREATE TABLE data (
                  metric text,
                  time timeuuid,
                  attributes map<text, text> static,
                  values_double map<text, double>,
                  values_bigint map<text, bigint>,
                  values_text map<text, text>,
                  PRIMARY KEY (metric, time)
                );
                
                

                 

                The value column is a double because we are primarily interested in numeric metrics. We could add additional columns (or tables if more appropriate) for string and value types if necessary. How likely do you think it is that clients will want to send data points that consist of both numeric and string data?

                I'd say things this way: numeric metrics are the top priority as they are the only kind of time-series supported in RHQ's Storage Nodes, but we're interested as much in numeric metrics as in logs, call time... etc.

                 

                I expect usage of data points mixing both numeric and string data to be quite common if we do provide Influx compatibility. But even putting Influx aside, it can be interesting in order to store "complex" data or data with context.

                 

                Complex data: REST Endpoint Calltime

                INSERT INTO data (metric, time, values_bigint, values_text) VALUES (
                     'myapp.resources.performance',
                     12315601088,
                     {
                          'calltime': 128,
                          'response_code': 200
                     },
                     {'uri': '/path/to/resource'}
                )
                

                 

                Data with context: disk usage

                INSERT INTO data (metric, time, values_bigint, values_text) VALUES (
                'disk_stats',
                12315601088,
                {
                     'free': 2000848089,
                     'used:' 701515156165
                },
                {
                     'datacenter': 'marseille',
                     'room': 'calanques',
                     'host': 'sugiton',
                     'mount_point': '/pg_data'
                }
                )
                

                 

                We cannot add the date/timestamp to the map column. The timestamp is a clustering column. Maybe you are confusing Cassandra and InfluxDB terminology? We will support naming data points via tagging.

                Not sure what you're referring to.

                • 5. Re: Re: Re: Storing Multiple Metrics in a Single Time Series
                  john.sanda

                  With respect mixing string and numeric or complex metrics, I also think it could open up some interesting possibilities. With that said, it appears from the cadvisor output and the corresponding influxdb mapping code that only numeric values are getting stored.

                   

                  Storing context with data is important. We have attributes and tags (so far) for that. In your disk usage example, it probably makes more sense to store the second map, {'datacenter': 'marseille', 'room': calanques', 'host': 'sugiton', 'mount_point': '/pg_data'}, as attributes and/or tags. You would definitely want to utilize tagging for the added filtering it provides.

                   

                  Your REST endpoint calltime example is a good illustration of why I characterize the grouping aspect of the discussion as a performance enhancement. calltime, response_code, and uri could all be stored as separate, individual metrics. We can provide the grouping at the API level but still store them separately. There may be times where storing the data in separate time series makes more sense. Writes are distributed across more partitions which can be good for balancing load across the cluster. Reads across multiple partitions can be executed in parallel, increasing throughput.

                  • 6. Re: Re: Re: Storing Multiple Metrics in a Single Time Series
                    tsegismont

                    With respect mixing string and numeric or complex metrics, I also think it could open up some interesting possibilities. With that said, it appears from the cadvisor output and the corresponding influxdb mapping code that only numeric values are getting stored.

                    To be honest, I'm not sure I understand the linked Go code. But I'll try to setup cAdvisor + Influx on my box and check how time-series are defined.

                     

                    Storing context with data is important. We have attributes and tags (so far) for that. In your disk usage example, it probably makes more sense to store the second map, {'datacenter': 'marseille', 'room': calanques', 'host': 'sugiton', 'mount_point': '/pg_data'}, as attributes and/or tags. You would definitely want to utilize tagging for the added filtering it provides.

                    This is how tags would be modeled in Cassandra, AFAIK:

                    CREATE TABLE tags (

                        tenant_id text,

                        tag text,

                        type text,

                        metric text,

                        time timeuuid,

                        raw_data double,

                        aggregates set<frozen <aggregate_data>>,

                        availability blob,

                        PRIMARY KEY ((tenant_id, tag), type, metric, time)

                    );

                     

                    Could you show me how you would insert data, disk_usage example?

                     

                    Your REST endpoint calltime example is a good illustration of why I characterize the grouping aspect of the discussion as a performance enhancement. calltime, response_code, and uri could all be stored as separate, individual metrics. We can provide the grouping at the API level but still store them separately. There may be times where storing the data in separate time series makes more sense. Writes are distributed across more partitions which can be good for balancing load across the cluster. Reads across multiple partitions can be executed in parallel, increasing throughput.

                    Yeah, grouping at the API level is important, I think. I was asking about grouping at the storage level because I assumed it would be more efficient.