6 Replies Latest reply on Aug 22, 2012 5:43 AM by chrispoulsen

Jcr modelling with versioning best-practise question

chrispoulsen Aug 3, 2012 5:42 AM

Hi,

We are currently building a model editing tool for one of our other products.

The models have some similarity to the contents of a CMS system (images, videos, markup text and some other structures) and we need versioning on a large part (actually most) of the node hierarchy - so it is possible to go back in history and see all changes made to a model (or how it looked at an earlier point in time).

A change to the model would go something like this:

VersionManager.checkout( "/modelling/customerxxx/" )
Perform changes somewhere deep in the hierarchy below "customerxxx" and save
VersionManager.checkin( "/modelling/customerxxx/" )

I'm thinking that we'd clean out older versions of the versionhistory periodically (or define "release versions" and remove the versions between releases).

Is it advisable/possible to version the "whole" model tree using a node close to the root of the repository as the amount of data grows? Are there any caveats that we need to lookout for or alternative (better) ways to achieve the same functionality?

(Currently we are persisting our data using the JPA Source if that matters)

Chris

1. Re: Jcr modelling with versioning best-practise question

rhauch Aug 15, 2012 8:41 AM (in response to chrispoulsen)

I'm thinking that we'd clean out older versions of the versionhistory periodically (or define "release versions" and remove the versions between releases).

That will definitely help.

Is it advisable/possible to version the "whole" model tree using a node close to the root of the repository as the amount of data grows? Are there any caveats that we need to lookout for or alternative (better) ways to achieve the same functionality?

Generally, no it is not advisable to mark only the higher-level nodes as versionable. Checking in a versionable node really results in a new copy of the versionable subgraph being created and placed into version history. So if you're only versioning at the higher-level nodes and have a substantial amount of content below each versionable node, each snapshot will contain a copy of all nodes below that versionable node. This may be efficient if most of the nodes have changed, but obviously will be very inefficient if only a small fraction of the nodes have changed (because many of the versions will contain copies of the same nodes in the same state).

One way to circumvent the inefficiency is to mark nodes that are lower in the higherarchy as versionable. The advantage is that your application can then check in the nodes that have changed; the disadvantage is that your application has to explicilty check in each of the versionable nodes.

A lot depends on what's below each customer node (e.g., "/modelling/customerxxx"), how much of that subgraph changes each time it is checked in, how big each subgraph is (e.g., number of nodes and size of the property values), how many versions are made between releases, and how many releases there are.

(Currently we are persisting our data using the JPA Source if that matters)
Generally it doesn't, but that connector certainly has the fewest limitations. Be sure to look at How To Tune ModeShape for Better Performance
1 of 1 people found this helpful
Actions
2. Re: Jcr modelling with versioning best-practise question

chrispoulsen Aug 20, 2012 9:05 AM (in response to rhauch)

Thank you for your comments, they are highly appreciated!
Randall Hauch wrote:
Is it advisable/possible to version the "whole" model tree using a node close to the root of the repository as the amount of data grows? Are there any caveats that we need to lookout for or alternative (better) ways to achieve the same functionality?

Generally, no it is not advisable to mark only the higher-level nodes as versionable. Checking in a versionable node really results in a new copy of the versionable subgraph being created and placed into version history. So if you're only versioning at the higher-level nodes and have a substantial amount of content below each versionable node, each snapshot will contain a copy of all nodes below that versionable node. This may be efficient if most of the nodes have changed, but obviously will be very inefficient if only a small fraction of the nodes have changed (because many of the versions will contain copies of the same nodes in the same state).

One way to circumvent the inefficiency is to mark nodes that are lower in the higherarchy as versionable. The advantage is that your application can then check in the nodes that have changed; the disadvantage is that your application has to explicilty check in each of the versionable nodes.

A lot depends on what's below each customer node (e.g., "/modelling/customerxxx"), how much of that subgraph changes each time it is checked in, how big each subgraph is (e.g., number of nodes and size of the property values), how many versions are made between releases, and how many releases there are.
We are expecting that changes to customer models will mostly be changes to a small part of the subgraph, so a more fine-grained versioning scheme is most likely the way to go for us.

The reason why we initially started out versioning near the root of the data is that we need subversion-like functionality (e.g to be able to extract what a model looked like 2 days ago / list all changes to a model that has happened over the past 2 days and so on.)

Is there a clever way to achieve this functionality while running versioning on subgraphs "deeper" in the hierarchy - It would be nice if it is possible to use one of the query languages to extract a node hierarchy based on when changes happened, but as far as I can tell the Query stuff is not able to help there.

I guess I can whip up some code that can traverse a large graph extracting the "correct" version (according to some criteria) when it hits a versioned node, but it seems inefficient and cumbersome.

I've also considered tracking changes in a side-table using the observation feature, but haven't looked deeply into that feature yet.
Actions
3. Re: Jcr modelling with versioning best-practise question

rhauch Aug 20, 2012 9:58 AM (in response to chrispoulsen)
The reason why we initially started out versioning near the root of the data is that we need subversion-like functionality (e.g to be able to extract what a model looked like 2 days ago / list all changes to a model that has happened over the past 2 days and so on.)
IMO, this is a major shortcoming of JCR's versioning system. The specification requires that the Versions in the version history have different "jcr:primaryType" and "jcr:uuid" properties than the real versioned nodes, making the Version subgraphs different enough from the actual content that you may not be able to use the same code. JCR versioning seems to have been designed for recovery/restoration of content from the past, not necessarily accessing in the same way multiple versions of the same content.

Is there a clever way to achieve this functionality while running versioning on subgraphs "deeper" in the hierarchy - It would be nice if it is possible to use one of the query languages to extract a node hierarchy based on when changes happened, but as far as I can tell the Query stuff is not able to help there.

ModeShape 2.x excluded the content under "/jcr:system" (where version storage exists) from queries, but ModeShape 3 has fixed this. Perhaps you might be able to find all of the Version nodes for the versionable nodes below the customer's path (this is the tricky part), adding criteria to find the Version nodes created before the date of interest. May be worth investigating a little.
I've also considered tracking changes in a side-table using the observation feature, but haven't looked deeply into that feature yet.

Interesting. If you stored the changes for a customer under the customer node, then it might it pretty easy to "rollback" the application's view of the customer to the state at a previous time. It certainly would give you a very good understanding of what changed. Of course, it also depends on the number of changes and the kinds of changes (e.g., mostly property changes, or lots of structural changes?). If you decide to go this route on your own and are willing to share your design, please keep us informed - perhaps we could incorporate it as a feature in some future version of ModeShape.

You could consider maintaining your own snapshots of the customer subgraph, perhaps more for the "releases". The big advantage is that you can see the various "released" states of a given customer using the exact same code - it's all content. An analogy is how a Maven repository provides access to all versions, since the version is just another segment in the paths. Plus, it could be called by the application (e.g., a utility method akin to "checkin") that encapsulates the snapshotting/copying.

Again, I think versioning at the customer level would work fine (and would be the easiest solution) if you can minimize the number of versions for each customer. ModeShape 3 has some improvements here, as well: any binary values in the versioned subgraph don't need to be copied as part of the "checkin" process, since they're maintained separately. And, you mentioned earlier that you could clean out older intermediate versions from the customer's history when they're not needed.

With my very limited understanding of your scenario, I might consider three options:

Introduce a "release version" layer into the hierarchy and maintain separate copies of the customer data as it is changed (e.g., "snapshots"). Then, at some point, "release" the changes by keeping the latest snapshot and removing the other snapshots made since the prior release. (Essentially you're doing your own versioning, but as regular content and by managing and removing intermediate versions of each customer that are not needed for posterity.) PRO: Everything is content with the same structure and is thus easy to compare/analyze. CON: You have to manage all versioning activities.
Introduce a "release version" layer into the hierarchy, where you copy the latest state of a customer's subgraph into a new "release". But, for recurring changes (since the last release), use JCR versioning. (When a new release is made, remove the versions from the version history.) PRO: All releases are content with the same structure and are easy to compare/analyze. CON: You have to manage the "release" logic, and intermediate changes are tracked differently.
Use JCR versioning at the customer level, accept that each version will be larger, but address the performance implication by removing versions of a customer where possible. PRO: Use JCR versioning API for all tracking. CON: version history doesn't have exactly the same structure.

#3 is essentially what you were originally asking about, whereas #1 is basically one form of "doing it all" in your application. Of course, you probably should investigate and prototype any design approach to make sure it truly does fit your needs - only you know enough to make an educated decision.

Hope this helps!

P.S. Thanks again for starting this discussion. I've found it extremely valuable, and I'm sure others will, too.
Actions
4. Re: Jcr modelling with versioning best-practise question

chrispoulsen Aug 21, 2012 10:01 AM (in response to rhauch)
Thank you for your suggestions/insights, they are highly appreciated!

I expect that most of the changes to our customer models will be property changes once the models are created.

I am expecting to switch to Modeshape 3 when it leaves beta, the new feature-set looks very interesting and we've also had some trouble with hibernate leaking memory during byte[] reads from oracle, so moving to infinispan with jdbc cacheloader sounds tempting.

Btw. are we supposed to be able to hold all jcr data (model graphs) in memory/infinispan in MS3 at once or is it clever enough allow us to have large models with big movies etc. attached. (The large binary contents are usually only referenced "directly and one at the time" as some download-servlet provides them as streams as web clients (browsers) link to them - so they are not needed for most model operations.

As we are interested in having tree-wide versioning like known from subversion I've thought about implementing the following scheme:

Our content nodes will have a "d:versionable" mixin requiring that a revision number/timestamp is specified.
When nodes are created the revision number is extracted from a sequence and set on the nodes.
When nodes are updated we use jcr versioning to keep history for those particular nodes (this will in almost every case happen deeply in the graph and only affect few nodes) and the "current node" will receive a new revision number from the sequence.
When nodes are "deleted" they are versioned using jcr versioning (like for update) - but instead of removing them; they are marked with a "d:deleted" mixin to allow me to programmatically investigate their jcr version history looking for a relevant revision number.
When working with "historical" data things should be read-only to avoid screwing up things.

As far as I can tell this will prevent us from having to copy all the untouched nodes around for every little model change, while allowing us to retrieve a former version of the graph relatively easy.

Say if we want revision NNN of a customer model, it is a matter of traversing the graph looking for nodes with revisions equal or lesser than the wanted revision. When jcr versioned (updated/deleted) nodes are encountered their history/deleted flag is checked to make sure the correct graph is constructed.

Does this sound like a viable way of achieving the "model-wide" versioning without having to cope with lots of unchanged nodes being copied around by jcr versioning?

--
Chris
Actions
5. Re: Jcr modelling with versioning best-practise question

rhauch Aug 21, 2012 12:23 PM (in response to chrispoulsen)
I am expecting to switch to Modeshape 3 when it leaves beta, the new feature-set looks very interesting and we've also had some trouble with hibernate leaking memory during byte[] reads from oracle, so moving to infinispan with jdbc cacheloader sounds tempting.
We'll release Beta3 tomorrow, and probably a few more before Final mid September. Of course, if you've got time, we'd love for you to try it out. What is your deployment model: embedding ModeShape in your (web) application, or using the JBoss AS integration, etc?

Btw. are we supposed to be able to hold all jcr data (model graphs) in memory/infinispan in MS3 at once or is it clever enough allow us to have large models with big movies etc. attached. (The large binary contents are usually only referenced "directly and one at the time" as some download-servlet provides them as streams as web clients (browsers) link to them - so they are not needed for most model operations.

ModeShape 3 should be able to handle many hundreds of GBs (or larger) of repository content, while relying upon Infinispan to effectively manage the memory. Obviously, keeping all that in memory on a single machine is impossible (except for $$$ machines), so for non-clustered repositories you'll definitely want to configure the Infinispan cache to persist any information it doesn't keep in memory. For small clusters, you can configure Infinispan cache to be replicated but still probably want to persist any information not kept in memory. But for larger clusters, you can configure the Infinispan cache as a data grid and distribute the information across the cluster while keeping at least N copies of the data; in these situations, all content can be stored in memory, because the available heap is combined from the heap on all the processes in the data grid.

The beautiful part of this is that, once your repositories (and Infinispan caches) are configured, your repository clients don't have to worry about any of this, and can access/modify the content the same way regardless of the repository's size, configuration, or persistence mechanism.

As for large binary content, ModeShape 3 does a far better job at managing and storing (very) large binary values. However, many applications stream content (e.g., large movies and videos) using special infrastructure to handle the large volume of users, so in these cases it's probably best to have your repository content store a reference (e.g., a URL) to the streamable resources stored outside the repository. But documents, images and other resources that don't require or use special infrastructure are often more effectively stored within the repository.

When large binary values are stored within the repository, they're stored within ModeShape 3's "binary store". We have several storage options: on the file system, in a relational database, in separate Infinispan caches (two are required for each repository, and they are to be distinct from the cache used by the repository for regular content), and MongoDB. The file system binary store is actually quite efficient: the InputStream returned from the Binary value is actually a buffered input stream to the underlying file itself, so it's fast, efficient, and consumes minimal heap (just enough for the buffering). All other stores are very new and may not be fully usable in Beta3, but we're stabilizing them pretty quickly.

As we are interested in having tree-wide versioning like known from subversion I've thought about implementing the following scheme:

Our content nodes will have a "d:versionable" mixin requiring that a revision number/timestamp is specified.
When nodes are created the revision number is extracted from a sequence and set on the nodes.
When nodes are updated we use jcr versioning to keep history for those particular nodes (this will in almost every case happen deeply in the graph and only affect few nodes) and the "current node" will receive a new revision number from the sequence.
When nodes are "deleted" they are versioned using jcr versioning (like for update) - but instead of removing them; they are marked with a "d:deleted" mixin to allow me to programmatically investigate their jcr version history looking for a relevant revision number.
When working with "historical" data things should be read-only to avoid screwing up things.

As far as I can tell this will prevent us from having to copy all the untouched nodes around for every little model change, while allowing us to retrieve a former version of the graph relatively easy.

Say if we want revision NNN of a customer model, it is a matter of traversing the graph looking for nodes with revisions equal or lesser than the wanted revision. When jcr versioned (updated/deleted) nodes are encountered their history/deleted flag is checked to make sure the correct graph is constructed.

Does this sound like a viable way of achieving the "model-wide" versioning without having to cope with lots of unchanged nodes being copied around by jcr versioning?

Yes, this sounds very interesting and viable! I'm anxious to hear how it turns out.
1 of 1 people found this helpful
Actions
6. Re: Jcr modelling with versioning best-practise question

chrispoulsen Aug 22, 2012 5:43 AM (in response to rhauch)

Randall Hauch wrote:

We'll release Beta3 tomorrow, and probably a few more before Final mid September. Of course, if you've got time, we'd love for you to try it out. What is your deployment model: embedding ModeShape in your (web) application, or using the JBoss AS integration, etc?
That sounds great, I've already tried setting up beta2 in a toy-project at home, just to get a feel for it. It wasn't too bad, most of the time was spent figuring out how to get Jboss AS 7.1 to behave.

We are currently deploying MS2 on Jboss AS 5.1, using a heavily modified "modeshape-services.jar" package and looking up the repository using JNDI in the web application.

Randall Hauch wrote:

As for large binary content, ModeShape 3 does a far better job at managing and storing (very) large binary values...
Thank you for the insights on large binaries and MS3, it is good to know that there are alternative options if it turns out that storing the large binaries in RDBMS becomes too heavy/troublesome.

The application I'm currently working on is a kind of authoring tool to allow customers to create "models" to be used in their guided trouble shooting software. Part of a model is some reuseable text units and some binary units (images/movies), the "rest" is a complex object graph of definitions to be fed to a kind of engine setting up bayesian networks based on the "definition" parts of the model.

Text units may include each other and binary units to avoid duplication (We haven't decided on how tight we want the linking to be yet, references/weak refs/paths...)

The trouble shooting system built on top of the engine outputs things like "questions" to the users to be able to solve their problems in the most efficient manner, questions/solutions etc. will be using the texts and binaries from the model.

So we are not looking into serving hundreds of concurrent users, it is more likely the we will be closer to 10 than 100 (I think) - When a model is ready to release, it is wrapped up (tagged?) and exported to the system(s) handling the trouble shooting parts. Emphasis is on versioning, workflows, managing concurrent editing, durability etc. not scaling to many users.

Chris Poulsen wrote:

<some suggestion on how to address the versioning requirements/>

Randall Hauch wrote:

Yes, this sounds very interesting and viable! I'm anxious to hear how it turns out.

Again thank you! - I'll start making some local tests to see if things will work like expected.

--
Chris
Actions

Go to original post