6 Replies Latest reply on May 27, 2014 6:47 PM by mithomps

External data model

pilhuhn May 13, 2014 12:05 PM

Hey,

so we now have some backend implementations for RHQ-metrics at rhq-project/rhq-metrics · GitHub and in both cases, json data can be consumed and retrieved via REST-api.

It looks like for posting new data to the REST-server, both currently use

{

"id":"foo",

"timestamp":1234,

"value":12.3

}

as format for an individual data point. In previous discussions we were not clear if the id is an int or a String.

For retrieval of data we have two different formats right now, one a bit brief, as the api call only returns data for a single id: [{"timestamp":1399984022503,"value":11.0}]

and then

{

"bucket": "raw",

"id": "100",

"data": [

{

"time": 1398891828116,

"value":5.0

}, {...}

]

}

( What is that "bucket"? Are we leaking internal implementation details here?)

Other than the bucket this looks like a wrapper around the above more slim one.

I think we should standardize on the property names e.g. time vs. timestamp.

And there was a recent discussion about multi tenancy etc:

Do we want to include an explicit tenant id, which may just be empty / zero at the start?

1. Re: External data model

john.sanda May 15, 2014 11:33 AM (in response to pilhuhn)

I agree that we definitely need to standardize on names and what not. I honestly haven't thought too much about the APIs or message formats yet, but here is one thing I did consider. If I am retrieving data for a specific id, I did not think there was a need to include an id field in each of the returned time/value objects. Likewise for inserting data.

The bucket field corresponds to the bucket column in the metrics table. Using RHQ as our example, this would have as value one of raw, one_hour, six_hour, or twenty_four_hour. Not sure if it needs to be included in the response. Need to think about it some more.

The id is stored as a string. Clients, such as RHQ, can use integers, but they need to be stored as strings. And more importantly, it is up to the client to ensure uniqueness for ids.

Great question about the tenancy. Maybe we can address thread in a separate discussion?
Actions
2. Re: External data model

nstefan May 15, 2014 1:22 PM (in response to pilhuhn)

I think it is a good idea to reply with as much data as possible as long as system security is not compromised. Twitter is a good example where they blast API consumer with way too much information [1]. 140 characters + user name? Forget about that! How about over 200 lines of JSON content per tweet? And there is nothing wrong with that because it's cheap for clients to ignore fields that are not needed.

[1] https://dev.twitter.com/docs/api/1/get/search

I would use the same approach with RHQ Metrics. Return as much info as possible because consumers would consume/aggregate the data in their own ways. And I do not think the size of the data is a concern because the aggregated data has a limited and well defined amount of data points.
Actions
3. Re: External data model

heiko.braun May 27, 2014 5:06 AM (in response to pilhuhn)

External data model?

Two questions in response:

a) Can we expect a single external data format or do alternative transports bring their own, optimised wire format?
b) Why use JSON in the first place? IMO a simple text based data format, like StatsD [1], would be sufficient. JSON looks good, but parsing it is terrible and introduces he requirement for further libraries on both the client and the server.

[1] etsy/statsd · GitHub
Actions
4. Re: Re: External data model

nstefan May 27, 2014 4:43 PM (in response to heiko.braun)

Heiko Braun wrote:

External data model?

Two questions in response:

a) Can we expect a single external data format or do alternative transports bring their own, optimised wire format?

b) Why use JSON in the first place? IMO a simple text based data format, like StatsD [1], would be sufficient. JSON looks good, but parsing it is terrible and introduces he requirement for further libraries on both the client and the server.

[1] etsy/statsd · GitHub

The expectation is that a public REST interface supports at least JSON (similar to SOAP and XML). For Java, JSON support is a just library away and there are a few good choices. And if I am not mistaking things will get even better because of JSR-353.

One advantage of JSON is limited forward and backward compatibility for data model changes. Adding an extra field will not require any changes to client code; removing a field would require some code changes if the client is not already guarded. Complications only happen when a field is renamed.

I briefly looked at ets/statsd format and I already see a few problems. Clients will be forced to implement a parsing library since it is not a standard format, so the barrier of entry is higher. And then the representation would be less expressive. Basically the only way to discover everything is via documentation.

Sample from etsy/statsd:

<metricname>:<value>|<type>

We could implement something like this as an additional data format. JSON needs to be primary because of REST implicit requirements. Then we need to factor the time required to implement the server side library as well maintain the data format and data model. Plus, I would a make a requirement for a reference implementation for encoding & decoding data for Java. So the barrier of entry is lowered at least for Java clients.
Actions
5. Re: External data model

mithomps May 27, 2014 6:17 PM (in response to heiko.braun)

I think JSON is a very reasonable choice as data format as it is the new lingua franca of data formats. Easily consumable by any language/platform and easily readable. Probably a majority of our consumers will be custom UIs in javascript where parsing is built-in. To help drive overall acceptance I think using JSON will go a long way with keeping the bar low for integration. There is a whole ecosystem of JSON tools out there as well.

If we run into issues where JSON becomes a performance issue we can address it then and perhaps offer a high performance API with a different representation (at that point we would probably also want to consider binary formats if size/speed is really the objective).
Actions
6. Re: External data model

mithomps May 27, 2014 6:47 PM (in response to nstefan)

IMO both continuous data and discrete interval (aggregated and bucketized) data APIs should exist. It is all too easy to pull back heaps of data and overwhelm the client. With discrete intervals we never have to worry about too much data (give me the data between Date A and Date B in 60 slots). Its easier to ensure nicer looking graphs and won't clog up mobile clients.

Sure we can send the data back and have the client do this but then we are *sending* all of it back just to process it down (which could cause network bottlenecks because of the huge documents being sent around). Also, aggregating would need to be written in each client target language introducing the possibility of different aggregation bug in the different languages/platforms.
Actions

Go to original post