9 Replies Latest reply on Jun 13, 2014 6:42 AM by theute

API, support for tags

theute May 26, 2014 9:18 AM

OpenTSDB has a notion of tags which is pretty useful IMO.

Today a metric is composed of:

id
timestamp
value

And we may want to add

tags

Tags are pretty useful to categorize and search. For instance:

{id=cpu_mhz, timestamp=1234, value=1475, {ip=127.0.0.1, cpu=0, core=0}}

With this additional metadata you should be able to query data in a single request for all cores, or a single core for instance (Without knowing the number of cores upfront).

Should we support tags ? Do we need to support this now ?

1. Re: API, support for tags

heiko.braun May 26, 2014 9:52 AM (in response to theute)

+1

It does allow you to build some kind of multi tenancy support. I.e when tags describe hosts.

Since it will very likely impact data locality in cassandra, we should consider it upfront. But i think john can tell more about the problems when trying to retrofit something like this.

Am 26.05.2014 um 15:18 schrieb Thomas Heute <do-not-reply@jboss.com>:

JBoss Community
API, support for tags
created by Thomas Heute in rhq-metrics - View the full discussion

OpenTSDB has a notion of tags which is pretty useful IMO.

Today a metric is composed of:

id
timestamp
value

And we may want to add

tags

Tags are pretty useful to categorize and search. For instance:

{id=cpu_mhz, timestamp=1234, value=1475, {ip=127.0.0.1, cpu=0, core=0}}

With this additional metadata you should be able to query data in a single request for all cores, or a single core for instance (Without knowing the number of cores upfront).

Should we support tags ? Do we need to support this now ?

Reply to this message by going to Community
Start a new discussion in rhq-metrics at Community
Following rhq-metrics in these streams: Email Watches
Actions
2. Re: API, support for tags

john.sanda May 26, 2014 10:35 AM (in response to theute)
We have introduced something similar in RHQ with the aggregation schema changes described at https://docs.jboss.org/author/display/RHQ/Aggregation+Schema+Changes. In the metrics_cache and metrics_cache_index tables we have a start_schedule_id column. The sole purpose of this column is for grouping data in the metrics_cache table to reduce query overhead during aggregation. Ids in RHQ are monotonically increasing integers so we can easily auto-calculate the start_schedule_id or tag. In RHQ Metrics however, the client would have decide on that to performing that grouping.

What you have in mind is slightly different and would almost certainly require some additional changes other than the ones described in the aggregation schema changes doc. The most obvious approach to add support for tags is with a custom index, e.g.,

CREATE TABLE tags ( tag_name text, tag_value text, metric_id, text, PRIMARY KEY (tag_name, tag_value) )

If we want a tag to apply to all metric data for a particular metric id, then this is all we need for schema changes. But if we want more fine-grained tagging where we support tagging specific values, we would have to consider other, possibly different schema changes.
Actions
3. Re: API, support for tags

heiko.braun May 26, 2014 1:02 PM (in response to john.sanda)

John Sanda wrote:

"If we want a tag to apply to all metric data for a particular metric id," ...

Well maybe we should talk how tags could be used. Some use cases I am aware of:

a) use tags to provide "multi tenancy" support:
i.e. "ds-pool-size", "host=master"

which allows several tenants to provide the same metrics

b) advanced query capabilities (i.e. some level of ad-hoc query support)

i.e. "http-response-time", "browser=chrome", "path=/checkout", "os=linux", "origin=foo.bar.com"

this would allow to query for all clients the refer to the same origin or use the same browser.

The first case could be expressed using a single tenant_id opposed to multiple tags. The second case easily leads to tag explosion and might hit a performance wall easily.

What use cases did you have in mind?
Actions
4. Re: API, support for tags

john.sanda May 26, 2014 10:53 PM (in response to heiko.braun)

Well maybe we should talk how tags could be used. Some use cases I am aware of:

a) use tags to provide "multi tenancy" support:

i.e. "ds-pool-size", "host=master"

which allows several tenants to provide the same metrics

b) advanced query capabilities (i.e. some level of ad-hoc query support)

i.e. "http-response-time", "browser=chrome", "path=/checkout", "os=linux", "origin=foo.bar.com"

this would allow to query for all clients the refer to the same origin or use the same browser.

The first case could be expressed using a single tenant_id opposed to multiple tags. The second case easily leads to tag explosion and might hit a performance wall easily.

What use cases did you have in mind?

heikobraun, these are some great examples.

I do not think tags are the way to go for implementing multi tenancy support. Multi tenancy should provide filtering or partitioning of all data for clients. If I am one client organization or tenant and you are another, then I should only be able to access data for my organization such that it seems as though I am the only client, and vice versa for you. This suggests to me the need for an org_id or tenant_id column in the metrics and in the tags tables and probably in other table that we add. I do think that we need multi tenancy support early on for cloud environments like OpenShift. Multi tenancy merits its own separate thread, assuming we are in agreement that tags are not the best way to implement it.

I am not worried about tag explosion so long as we identify use cases we want to handle and factor that into our data model. The example schema I mentioned might be fine for filtering by a particular tag. But let's suppose we want to support filtering by an arbitrary number of tags. Queries can get more inefficient as we filter by more tags. We would probably want to think about more and/or different schema changes to address the use case of filtering by multiple tags.

I did not have a specific use case in mind for fined-grained tagging. I really just wanted to point out that with the example schema I gave, the tagging is at the data stream/id level and not at the individual data point level. I could tagging of individual data points being useful with alerting. Here's another example. Suppose I make some temporary change to the resource I am monitoring, i.e., increase JVM heap, and I want to indication of that with the metrics I am collecting.
Actions
5. Re: API, support for tags

pilhuhn May 27, 2014 1:43 AM (in response to heiko.braun)

It does allow you to build some kind of multi tenancy support. I.e when tags describe hosts.

Which leads to the question how to we define tenant. I think a tenant would be a completely separate user like a different OpenShift account, which can then have multiple hosts/gears to monitor.
In this case I would not include that inside the tags, but rather in a separate column.
Similar for the host if we decide that every metric has have a host associated (and I think this is not always true if e.g. an application wants to record business data that is not host-specific).
We should also consider the space requirements if arbitrary sets of tags can be supplied.

Reading through the OpenTSB example, it looks a lot like what we do in RHQ, but encoded on the stored data instead of the Schedule we have in RHQ.

>> mysql.bytes_received 1287333217 327810227706 schema=foo host=db1

In RHQ that would be a MeasurementDefinition of "mysql.bytes_received" (where the definition also adds support for units, description, type (dynamic, monotonically increasing)) applied to a resource "schema=foo" as a child resource of "host=db1". In RHQ we then store the numerical id of this schedule in the tuple of <id,time,value>.
Actions
6. Re: API, support for tags

heiko.braun May 27, 2014 4:44 AM (in response to pilhuhn)

how do we define tenant?

Yes, my example was somewhat ambiguous. From my point of view, a "tenant_id" attribute can be used for anything like "host", "application" or "user". Basically any client that want a metric of the same kind to to be separated along a organisational constraint.

I hope this makes sense.
Actions
7. Re: API, support for tags

mithomps May 27, 2014 5:41 PM (in response to heiko.braun)

More possible use cases for tags:

1) Being able to tag a single metric point(s) as interesting, giving it whatever tag name I want so I can later search for this metric "incident". For instance, being able to right click on a point in a graph to tag it as whatever.
2) Grouping multiple graphs together so that they can be viewed as a cohesive group. For instance, resource A has metric 1, metric 3 and metric 6 that should be shown to provide the proper context around an "issue".
Actions
8. Re: Re: API, support for tags

john.sanda May 27, 2014 10:50 PM (in response to mithomps)
1) Being able to tag a single metric point(s) as interesting, giving it whatever tag name I want so I can later search for this metric "incident". For instance, being able to right click on a point in a graph to tag it as whatever.

This got me thinking some about a possible schema. I came up with a slight modification from before,

CREATE TABLE tags ( tag_name text, tag_value text, metric_id text, time timestamp, value map<int, double>, PRIMARY KEY (tag_name, tag_value), metric_id) )

This schema allows to insert a tag for an entire an entire time series,

INSERT INTO tags (tag_name, tag_value, metric_id) VALUES ('test', 'one', 'bar');

Note that we do not have to specify the time or value columns. We ca insert a tag for a single data point as well,

INSERT INTO tags (tag_name, tag_value, metric_id, time, value) VALUES ('test', 'one', 'foo', 2014-05-21 22:16:27-0400', {0: 787.57});

In terms of querying, we can filter by tag_name, tag_name and tag_value, or tag_name, tag_value, and metric_id,

SELECT * FROM tags WHERE tag_name = 'test'; SELECT * FROM tags WHERE tag_name = 'test' AND tag_value = 'one'; SELECT * FROM tags WHERE tag_name = 'test' AND tag_value = 'one' AND metric_id = 'foo';

This makes it easy for example, to find all metric_ids with a particular tag name/value. The schema does not however provide support for storing multiple values for a tag_name, tag_value, metric_id tuple. I started investigating user defined types (UDTs) as possible solution. A possible solution might look something like tags with UDTs,. Not sure why but the syntax highlighter clipping my code sample. This offers a flexible approach for storing data points as well. UDTs are already available in Cassandra 2.1-beta2; however, support in the driver is not all that great yet. Another option could be to store the values in an encoded format such as JSON or a more performant binary format like http://msgpack.org.
Actions
9. Re: API, support for tags

theute Jun 13, 2014 6:42 AM (in response to john.sanda)

Discussing with Kevin Conner, it seems that they would need support for tags with multiple values.
Would be good to clearly define what tags would be for exactly and how filtering would work to make sure that we are all on the same page
Actions

Go to original post