1 2 Previous Next 15 Replies Latest reply on May 15, 2014 3:35 PM by nstefan

Dev environment

john.sanda May 12, 2014 11:23 AM

Executing unit/integration tests requires a running instance of Cassandra 2.x. I wrote up a brief set of build instructions[1] that suggest using ccm[2] to install and run a cluster for development/testing. ccm is a good tool built by one of the lead Cassandra developers. I definitely think it a a good tool to have in the developer toolbox; however, getting started with rhq-metrics might be a bit cumbersome. People might run into problems installing ccm. I have as well as others. You can go ahead and download and install/configure Cassandra yourself, but that is cumbersome as well, especially if you are not very familiar with Cassandra.

There was discussion on #rhq about disabling the Cassandra bits in the build by default, but as Cassandra is the target/primary data store, I would prefer to provide a solution that makes things easier for developers. What about doing something similar to what we do in RHQ?

In RHQ we automatically deploy and tear down clusters for running integration tests. We could do something similar and expand on that so that developers could easily create a cluster for use beyond running tests, like for working with a REST client he is developing. The tooling would be limited deploying clusters running only on localhost. I think that is ok though because once a developer gets a little invested in the project, setting up a cluster on his own if necessary becomes more bearable.

Thoughts?

[1] rhq-project/rhq-metrics · GitHub

[2] https://github.com/pcmanus/ccm

1. Re: Dev environment

john.sanda May 12, 2014 11:32 AM (in response to john.sanda)

My original statement about the build is incorrect. A running Cassandra instance is required even if you disable tests, e.g., mvn -DskipTests install. There is a script that executes in the test-compile phase which attempts to connect to a cluster and install a schema for testing.
Actions
2. Re: Dev environment

pilhuhn May 12, 2014 11:41 AM (in response to john.sanda)

> There was discussion on #rhq about disabling the Cassandra bits in the build by default, but as Cassandra is the target/primary data store,
> I would prefer to provide a solution that makes things easier for developers. What about doing something similar to what we do in RHQ?

Which developers are we talking here?

For a developer who is working on the C*-interfacing code it is for sure easier when C* is always running. If I am developing the rest-api and want to see how different json / xml encodings work in the wild, I am happy with a in-memory "backend", that is able to store a few values to be displayed later. To create new fancy graphs, it is probably cool to even have a hard-coded list of values that are served via rest, so that a tiny container is enough.
We had a lot of benefits from allowing to use a real db in tests, but also had a lot of pain because of that.

The other side of the medal are the user. If you want to integrate rhq-metrics e.g. into Wfly, you don't want to require to (download, install &) start a C* cluster before starting Wfly for development.
Actions
3. Re: Dev environment

john.sanda May 12, 2014 12:24 PM (in response to pilhuhn)

I am talking about both developers and users. By developer, I mean someone who wants to make changes within rhq-metrics code whereas a user would be someone who wants to write code against the APIs provided by rhq-metrics. Both scenarios though, at least for now, requiring cloning the Git repo and building the project. I agree that a user is probably less interested in the backend, but how much time and effort do we want to spend making a pluggable backend? I would rather have everyone using the same Cassandra backend so that we can more quickly and easily identify bugs, use cases, configuration issues, etc. If there is solid tooling available to make setting up the backend easy, then it should not be a problem for the user. RHQ is a good example with respect to Cassandra. You can clone the RHQ repo and execute the tests without even being aware that Cassandra is deployed.

If I want to do some WildFly development that involves JPA, JMS, or whatever, I am going to have to do some configuration to talk to those external resources. It might be that the databases is embedded, but some configuration is still required. It may be that the configuration is already done by default, but the point is, there is some configuration required. If we can easily spin up a development cluster in the RHQ build and in the rhq-metrics (yet to be done), why cannot the same approach be take with WildFly development?
Actions
4. Re: Dev environment

theute May 13, 2014 6:32 AM (in response to john.sanda)

I am +1 on in-memory out of the box (and option to compile/test without requiring C*).
It's better to "pay-the-price" as you go rather than requiring more complex setup from the beginning.

Very similar to using H2 in the application server by default, not something you would do in production but extremely convenient.

The middle-ground solution doesn't seem that appealing to me.

I really want to build rhq-metrics by doing "mvn clean install", find a script in bin directory, run it and be able to hit few REST endpoints to already play around.
Actions
5. Re: Dev environment

nstefan May 13, 2014 12:55 PM (in response to theute)

Thomas Heute wrote:

I am +1 on in-memory out of the box (and option to compile/test without requiring C*).

It's better to "pay-the-price" as you go rather than requiring more complex setup from the beginning.

Very similar to using H2 in the application server by default, not something you would do in production but extremely convenient.

With SQL it should easier to find common ground between multiple implementations. Even so, H2 fell out of the support wagon for RHQ because of performance optimizations and time needed to support three engines. Not to mention the Microsoft SQL Server effort that did not go anywhere. And SQL is standardized ...

This task would be much more complex because any in-memory would be completely different than Cassandra. And then there are questions about persisting data at stop and migrations. Would there be migration paths between the engines? What about running in production with in-memory? What if a user runs by mistake with in-memory and takes down a physical server? How do we prevent that? Should we prevent that? And the list would go on and on ...

The proper way to do this is a completely plugable architecture. We would make the storage completely plugable and write plugins for as many stores as possible: Postgres, Redis, MongoDB, etc. And one of them can be H2 (or something similar) that would fit some in-memory requirements. But that would go against one of the initial goals of the project: fast, robust, scalable data storage for metrics. We selected Cassandra for all those reasons (it just fits all the needs) and so far the metrics code was optimized around that. Support for in-memory data stores (or any other data store) would be a somewhat big departure from the initial goal and not a simple undertaking.

Thomas Heute wrote:

The middle-ground solution doesn't seem that appealing to me.

I really want to build rhq-metrics by doing "mvn clean install", find a script in bin directory, run it and be able to hit few REST endpoints to already play around.

John is working (see link below) on automating the life-cycle of Cassandra for testing and development purposes. This is the same approach that we took in RHQ. Developers do not know when and how a Cassandra cluster gets deployed when integration tests run in the mvn life-cycle. Developers just run the typical mvn command and when executing tests a Cassandra server (or cluster) is configured, started, used, and stopped. This accomplishes "mvn clean install" for developers without complicating the code, adding extra features that might not be needed, while still keeping the initial goal of the project.

And this sort of automation can be extended to the actual project deliverables. We need to spend more time find a good set of requirement and come up with design but the idea of having of single executable that starts/controls the project code and Cassandra was done in RHQ and can be done again.

PoC for automating C* deployment for development/testing by jsanda · Pull Request #2 · rhq-project/rhq-metrics · GitHub
Actions
6. Re: Dev environment

theute May 13, 2014 1:19 PM (in response to nstefan)

Note that we are not talking about Database support, "only" in-memory and Cassandra.

In memory is important to not add a hard dependency on Cassandra, if you take EAP as an example, it collects memory consumption in memory and display a graph, RHQ Metrics should be able to replace that and when a users wants to persist the data pay the price of installing Cassandra. There is no need to migrate data from in-memory to Cassandra.

Configure, start, use, stop Cassandra for tests is fine, but for default runtime, I am less confident (for the same reasons we don' t provision Oracle or MySQL from EAP)
Actions
7. Re: Re: Dev environment

john.sanda May 13, 2014 1:21 PM (in response to nstefan)

I want to follow up on a couple of the points that nstefan made. Not only do we have non-standard APIs for different data stores, the different architectures of each can also have a big impact on behavior. For example, someone using an in-memory data store will not have to worry about consistency. With Cassandra though, consistency is configurable and we may want to expose that to the user in some cases. We might want to let the user decide for a query whether latency or consistency is more important.

Supporting multiple data stores makes the API design more complex. Consider the List<String> listMetrics() method that was added to the API. For the in-memory implementation, this simply involves returning the map keys. If this is an operation that we actually need, we may very well want to introduce a new table in the Cassandra schema to support it. Depending on the use cases for the operation, this could otherwise be rather inefficient as it would involve querying multiple nodes instead of a single replica.

In RHQ we probably see a lot more Oracle-specific than Postgres-specific production bugs. I think that this can in large part be attributed to the fact that Oracle simply is not used nearly as much in development.

We can build on my pull request (https://github.com/rhq-project/rhq-metrics/pull/2) to provide even more robust support. We can easily add support for starting/stopping nodes, adding additional nodes, and deploying different versions of Cassandra. We can also add a dev-mode type config option to the vert.x server that when set will cause a verticle/module to be deployed which handles the basic cluster management. The nice thing about this is that it fully encapsulates that basic management support such that it would not even require the user to first build rhq-metrics.
Actions
8. Re: Dev environment

pilhuhn May 13, 2014 2:47 PM (in response to nstefan)

Stefan Negrea schrieb:

This task would be much more complex because any in-memory would be completely different than Cassandra. And then there are questions about persisting data at stop and migrations. Would there be migration paths between the engines?

In memory is not meant to have the same feature set than a "full" solution. It is merely meant to store / retrieve metrics and perhaps deliver them in different formats like bucketized.
This will allow for embedding in situations where people want to keep a few metrics around and graph them for as long as the hosting JVM is up (*).
Also I expect the rest-api to talk against interfaces, that then either call into the C* backend or just into the memory one. So there is no point in having the rest-calls directly speak SQL or CQL.

Remember: one of the pain points we identified with current RHQ is the effort it takes to get started (small).

*) And if someone is willing to provide a cache-loader/-persister, we can just add add that, enable on demand and are cool
Actions
9. Re: Dev environment

nstefan May 13, 2014 4:17 PM (in response to pilhuhn)

Heiko Rupp wrote:

In memory is not meant to have the same feature set than a "full" solution. It is merely meant to store / retrieve metrics and perhaps deliver them in different formats like bucketized.

This will allow for embedding in situations where people want to keep a few metrics around and graph them for as long as the hosting JVM is up (*).

...

*) And if someone is willing to provide a cache-loader/-persister, we can just add add that, enable on demand and are cool

Can you please give more details about this in-memory data store. Do you have something in mind already? Would we build it from scratch? How would somebody transition between in-memory and something else?

I am really not sure what you mean by the cache-loader/persister ...

Heiko Rupp wrote:

...

Also I expect the rest-api to talk against interfaces, that then either call into the C* backend or just into the memory one. So there is no point in having the rest-calls directly speak SQL or CQL.

...

In John's vert.x prototype the REST Verticle calls into a backend Verticle that handles the heavy CQL workload. But that is about the extent of using interfaces and levels of indirection. If we are to implement multiple data stores we would need add at least one more level of indirection. The below diagram are generic, not necessarily related to the vert.x implementation; I am just using it as a concrete example.

John's prototype:
Rest Interface -> Backend -> Actual Storage

Multiple storage engines:
Rest Interface -> Backend -> Storage Interface -> Actual Storage

My early point was that if we go the route of in-memory then lets generalize and have many Storage Interface implementations. We start with H2 (or any other reputable in-memory engine) and Cassandra. And I should add MongoDB, Redis and Riak implementations as soon as possible. If we talk about two storage engines then it is easier to build a generalized solution rather than patch something up with special cases to support just two.
Actions
10. Re: Dev environment

tsegismont May 14, 2014 11:51 AM (in response to john.sanda)

Le 12/05/2014 17:24, John Sanda a écrit :
There was discussion on #rhq about disabling the Cassandra bits in the
build by default, but as Cassandra is the target/primary data store, I
would prefer to provide a solution that makes things easier for
developers. What about doing something similar to what we do in RHQ?

Jumping into the conversation

How about using cassandra-maven-plugin? rhq-metrics is a Maven based
project right?

http://mojo.codehaus.org/cassandra-maven-plugin/

If we go down the way of re-using the RHQ CCM, I can help building a
Maven plugin wrapper.
Actions
11. Re: Re: Dev environment

pilhuhn May 15, 2014 8:50 AM (in response to john.sanda)

Supporting multiple data stores makes the API design more complex. Consider the List<String> listMetrics() method that was added to the API. For the in-memory implementation, this simply involves returning the map keys. If this is an operation that we actually need, we may very well want to introduce a new table in the Cassandra schema to support it. Depending on the use cases for the operation, this could otherwise be rather inefficient as it would involve querying multiple nodes instead of a single replica.

I don't know if we need listMetrics() or not - I just feel that we should offer more than just storing a tuple somewhere and when you are lucky enough to retrieve it again.
Without listMetrics() it sounds to me like a file system where you can create and read files when you know their name but not retrieve them.
There will for sure be use cases where storing stuff like in rrdb will be enough. And then there will be other cases where you want additional stuff to live with the metrics like e.g. Metadata (e.g. units or dynamic vs. trendsup) and be able to make sense of it.
While it does not directly require listMetrics(), a key property of a rest-y api is browsability from a very limited set of starting points and (hyper)linking between items.
Actions
12. Re: Re: Dev environment

pilhuhn May 15, 2014 9:22 AM (in response to nstefan)

Stefan Negrea schrieb:

Can you please give more details about this in-memory data store. Do you have something in mind already? Would we build it from scratch? How would somebody transition between in-memory and something else?

about two storage engines then it is easier to build a generalized solution rather than patch something up with special cases to support just two.

In memory can be as simple as a Map<id, List < timestamp , value > >. Perhaps with a tiny bit of code around to evict the oldest items as soon as a certain threshold is hit (e.g. 10k values in total, or value older than 8h etc). In fact I have taken John's rhq-metrics/metrics-core/src/main/java/org/rhq/metrics/core/MetricsService.java at master · rhq-project/rhq-metrics · Gi… turned into an interface and added a metrics memory backend ( https://github.com/rhq-project/rhq-metrics/blob/master/metrics-core/src/main/java/org/rhq/metrics/impl/memory/MemoryMetricsService.java ). In the Servlet impl, the backend that is using the aforementioned interface is then selected here: rhq-metrics/rest-servlet/src/main/webapp/WEB-INF/web.xml at master · rhq-project/rhq-metrics · GitHub In a different setup (not based on servlet api), the selection can be via properties file or whatever fits.

If you look at General idea then the light green box is the server which can use those different backends. There is no need to keep / persist / migrate any data when switching backends.
The point here is that we want to have a version of this that can easily be embedded e.g. into a running WildFly server e.g. to be able to work on the charts without the need to first install Pyhton to be able to install Cassandra, to start Cassandra before I can even compile rhq-metrics.
Of course Cassandra is the chosen real production storage backend. But for testing out stuff e.g. as developer of rhq-metrics clients I should not need to have it running.

> I am really not sure what you mean by the cache-loader/persister ...

This could be two simple methods that dump the above mentioned map into a file and load it back.

> Multiple storage engines:
> Rest Interface -> Backend -> Storage Interface -> Actual Storage

> My early point was that if we go the route of in-memory then lets generalize and have many Storage Interface implementations.
> We start with H2 (or any other reputable in-memory engine) and Cassandra. And I should add MongoDB, Redis and Riak
> implementations as soon as possible. If we talk about two storage engines then it is easier to build a generalized solution
> rather than patch something up with special cases to support just two.

I am not sure we need all this - especially I do not think we would support it.
Defining an interface that somewhat abstracts the concrete backend away like MetricsService could even make sense with only Cassandra, as it will allow to implement a new backend if e.g. Cassandra 3 has very different apis than Cassandra 2. In this case we would want to have both backends in parallel for a while until the C3 backend is stable and can take over.
Actions
13. Re: Re: Dev environment

pilhuhn May 15, 2014 9:24 AM (in response to tsegismont)

Thomas Segismont schrieb:

Le 12/05/2014 17:24, John Sanda a écrit :

There was discussion on #rhq about disabling the Cassandra bits in the

build by default, but as Cassandra is the target/primary data store, I

would prefer to provide a solution that makes things easier for

developers. What about doing something similar to what we do in RHQ?

Jumping into the conversation

How about using cassandra-maven-plugin? rhq-metrics is a Maven based

project right?

http://mojo.codehaus.org/cassandra-maven-plugin/

If we go down the way of re-using the RHQ CCM, I can help building a

Maven plugin wrapper.

Everything that makes the process painless, helps.
The discussion I brought up some days ago was that the project was *not even compiling* when C* was not running, which according to the docs required to install Python before. This does not encourage anyone to contribute.
Actions
14. Re: Dev environment

nstefan May 15, 2014 3:13 PM (in response to pilhuhn)

Heiko Rupp wrote:

Supporting multiple data stores makes the API design more complex. Consider the List<String> listMetrics() method that was added to the API. For the in-memory implementation, this simply involves returning the map keys. If this is an operation that we actually need, we may very well want to introduce a new table in the Cassandra schema to support it. Depending on the use cases for the operation, this could otherwise be rather inefficient as it would involve querying multiple nodes instead of a single replica.

I don't know if we need listMetrics() or not - I just feel that we should offer more than just storing a tuple somewhere and when you are lucky enough to retrieve it again.

This is one of the reasons why we need a clearly defined interface for storage if we are to support multiple types of data stores (even if only 2 for now). Otherwise things like this will creep up the layers and will be impossible to know what is what very fast...
Actions

1 2 Previous Next

Go to original post