10 Replies Latest reply on Feb 15, 2012 6:41 PM by bwallis42

    Versions

    bwallis42

      I'm trying to work my way through the versioning in the JCR and am not quite getting it.

       

      My requirement is to maintain a history of all changes made to a tree of nodes where the depth of the tree is between 5-8 nodes but can be quite wide at the leaves (up to a few thousand) and the leaf nodes can be nt:file nodes with quite large binary contents (256K (typical) to more than 50M (less common)).

       

      So, I've been playing with a CND model and trying to understand how to configure the model so that I can checkout the node at the top of the tree, make changes within the tree and then check the top node back in again.

       

      With the following CND that seems to work

       

      {code}

      [A] > mix:versionable orderable

        - name (string) copy

        + b (B) multiple copy

       

      [B] > mix:versionable orderable

        - name (string) copy

        + c (C) multiple copy

       

      [C] > mix:versionable

        - name (string) copy

      {code}

       

      but I am concerned about how much space this is going to take up. Is a complete copy of the tree made every time I checkout/checkin the tree? Including all the MBytes in the nt:file nodes?

       

      thanks,

        • 1. Re: Versions
          rhauch

          Is a complete copy of the tree made every time I checkout/checkin the tree? Including all the MBytes in the nt:file nodes?

           

          By specifying that COPY for on-parent-versioning (OPV), you are expressly asking for copies to be made every time the versioned node (at the top of the copied nodes/properties) is checked in.

           

          The good news is that the BINARY values (or "large" STRING values) won't always be physically copied when a new version is created, but whether they are depends on the kind of connector you're using and if/how you're using federation. The main storage connectors (e.g., JPA, Infinispan, disk storage) do not need to physically copy the BINARY values, as they're managing such values centrally (based upon SHA-1), and these should be used for the system workspace. If you're using one of these connectors without federation, BINARY values should not need to be copied. However, if you're versioning content stored in any of the access connectors (e.g., SVN, file system, etc.), the initial version may be expensive, as the BINARY values need to be copied from the access connector into version storage. Subsequent versions will be faster as long as the same BINARY values are used.

           

          We're hopefully making this far easier in 3.x, with our improved and centralized BinaryStore that will be used by content and version storage alike. However, even with 3.x, any versionable content that is federated from external systems may still suffer the same copy-on-version issue. Longer term, we'd like to make it possible for 3.x connectors to manage the version histories for their own content.

           

          One approach that may help get around this is to have your application manage the large filess itself in a separate section of the repository in a relatively flat hierarchy based upon SHA-1 (or any other content-based hash). Because the location of the files is based upon the content-based hash (e.g., SHA-1), the content will never change and you never need to use JCR versioning for this file storage area. The rest of the content can have a lightweight reference to the files (via a property containing content-based hash or a REFERENCE to the file node), so no matter how you version this content the large values never are contained in the version histories.

           

          Another approach to versioning is to use an OPV of 'version' for the nodes that change at different times than others. The advantage is that they're versioned independently (perhaps less often), but the disadvantage is the extra work.

           

          BTW, the other OPV values (e.g., 'INITIALIZE', 'COMPUTE', 'IGNORE', 'ABORT') are not very useful for what you're trying to do.

           

          Hope this helps!

          • 2. Re: Versions
            bwallis42

            Thanks for that. I've done some more reading after reading your reply and I think I understand how it can work for our model now. We have a bunch of nodes with data about a patient and folders containing document nodes. Each document can be quite large, might have dozens of child nodes and may have quite a lot of metadata (30-50 properties?).

             

            So I think what I will do is mark the patient node and the base document node as mix:versionable and the references to the document nodes will have OPV attributes of "version", everything else has an OPV attribute of "copy". This is not that different from what our current system does although we really don't maintain the history at the patient level at the moment, just for each document.

             

            So when I want to change a document I just check it out, change it and then check it back in.

             

            If I want to add a document then I need to check out the patient node, add the document and then check both of them in.

             

            If I want to move a document from one patient to another I would only need to check out the two patient nodes, the document is unchanged.

             

            The version history of a patient record with a few hundred documents will contain links to every contained document.'s version history but will not have a copy of the document's data.

             

            If I want to restore a version I would need to locate the appropriate point in the version history for the patient and for all of the contained documents (labels would be useful here or just the checkin date). I would need to restore the patient node and then restore every document node (for that point in time). How does this work for documents that were added after the restore point? Are they left dangling? Garbage collected?

             

            Question: Is there a way I can get a view of a node and it's subnodes at a point in the past without doing a restore? Can I do a restore in a different workspace without affecting the current state in the default workspace (and then discard that other workspace) or something like that?

             

            thanks again.

            • 3. Re: Versions
              rhauch

              Your description sounds like a great approach.

              Question: Is there a way I can get a view of a node and it's subnodes at a point in the past without doing a restore? Can I do a restore in a different workspace without affecting the current state in the default workspace (and then discard that other workspace) or something like that?

               

              You can always access the VersionHistory object for a versionable node. To do that, get the VersionManager from the Session's Workspace, and use the "getVersionHistory(String path)" method. That VersionHistory contains all of the Version objects, which are essentially snapshots of the state of the versionable node when it was checked in.

               

              Use the "VersionHistory.getAllLinearVersions()" method to get an iterator over all of the Version objects, starting with the earliest and going towards the latest.

               

              The JCR API doesn't provide a way to directly iterate over the Versions in the opposite direction (from latest to earliest), but you can do that relatively easily by getting the "base" version (that is, the most recent version) and then getting the predecessor version using the "Version.getLinearPredecessor()" method, and continuing this as needed until the "getLinearPredecessor()" method returns null. BTW, ModeShape does this a bit more efficiently than "VersionHistory.getAllLinearVersions()".

               

              If your application needs to associate some "label" with a version, then you can add (and remove) labels with the VersionHistory object. This basically allows you to quickly find a particular Version by a logical label string. Just be aware that the "label" is expected to be a valid JCR name -- normally this isn't that big a deal.

              • 4. Re: Versions
                rhauch

                Oops. Missed a question.

                If I want to restore a version I would need to locate the appropriate point in the version history for the patient and for all of the contained documents (labels would be useful here or just the checkin date). I would need to restore the patient node and then restore every document node (for that point in time). How does this work for documents that were added after the restore point? Are they left dangling? Garbage collected?

                IIRC, restoring a patient to an older version, Vn, would remove all documents (which you said would be versionable with an OPV of 'version') that were added after the Vn was created. See Section 15.7 for details about restore.

                 

                Note that after restoring, any further checkins will result in a branch in the version history. For example, consider that we have a version history

                 

                A -> B -> C -> D -> E
                

                 

                where A is the first version in the history and E is the last. If the versionable node is restored to version C, changes are made to the versionable node, and it is checked in, then the version history becomes:

                 

                A -> B -> C -> D -> E
                          |
                          +-> F
                

                 

                Best regards.

                • 5. Re: Versions
                  bwallis42

                  We will have a service layer over the repository that will implement the various business level operations that we want to perform on the repository. One of the requirements of this layer is that we can get readonly access to the patient record as it was at some point in the past. The versioning as discussed here is the JCR way of keeping that historical information. What I want to do is use the same business code implementation to read those historical views, I don't want that code to know about versionhistory nodes or other things to do with the versioning.

                   

                  The best way to do this seems to me to be to create a temporary workspace, clone the patient node and its subtree into that temporary workspace and then in the temporary workspace, restore the patient node (and the attached document nodes) to the version that was current at the desired time in the past and then run the same code we use for accessing the current version of the patient data only in the new temporary workspace.

                   

                  Once we have finished our explorations of the past we can just delete that temporary workspace. The workspace name can just be a timestamp representing the time in the past that we are exploring.

                   

                  This is not an operation that is frequently performed in the system so it doesn't have to be particularly efficient. I don't think there are any operations that would require more than one patient node to be cloned to the temporary workspace so the number of nodes in the temp workspace is limited to between a few to a few hundred.

                   

                  Does this sound like a resonable approach? Does a workspace with a cloned part of the original data have much of a resource footprint?

                   

                  thanks,

                  brian...

                  • 6. Re: Versions
                    rhauch

                    That certainly would work. Creating workspaces is pretty lightweight. If you could create and reuse a single workspace, then you're saving a bit of overhead.

                     

                    But remember the VersionHistory and Version nodes that are kept in the version storage area? Each Version contains a "frozen node" that actually is the subgraph snapshot. For the most part, this "frozen" subgraph is identical to the state that would be restored. IIRC, the only differences are that the "jcr:primaryType", "jcr:mixinTypes" and "jcr:uuid" properties are moved into the "jcr:frozenPrimaryType", "jcr:frozenMixinTypes", and "jcr:frozenUuid" properties within the snapshot.

                     

                    So if you're not directly using these 3 properties, the content should look identical to what's normally in the workspace. If you are using these three properties, perhaps a simple toggle in the code might allow the it to read regular or frozen content. Then restore operations wouldn't even need to be used.

                     

                    It's at least something else to consider. :-)

                    1 of 1 people found this helpful
                    • 7. Re: Versions
                      rhauch

                      I guess another difference with the frozen state is that it ends at nodes that are versionable. So if your documents are versionable, the frozen subgraph of a patient's Version will end with placeholders for those documents.

                       

                      Each placeholder node has a primary type of "nt:versionedChild" and contains a single "jcr:childVersionHistory" REFERENCE property to the VersionHistory of the referenced document. When we restore, we find the Version of the versioned (document) node that has a timestamp that is anytime equal to or before the (patient) version being restored.

                       

                      Anyway, it is possible, though this approach may be more complicated than what you prefer to deal with.

                      • 8. Re: Versions
                        bwallis42

                        OK, thanks, I was wondering if that was the case. So if we are just a little careful with how we write the business code (and we do need access to the mixinTypes and uuid) then we can probably make this work quite well without the additional workspaces, cloning and restoring.

                         

                        Thanks for that.

                        • 9. Re: Versions
                          rhauch

                          (I just updated my previous post to talk about another issue. Be sure to read that before trying to proceed down this path.)

                          • 10. Re: Versions
                            bwallis42

                            So it looks like a tradeoff between complexity in the business code and the additional costs of creating a workspace, cloning some nodes, restoring the versioned state of those nodes and eventually removing that workspace.

                             

                            I tend to like the idea of simplifying the business code so it is easier to understand and maintain over time (our current product has some code in it up to 13 years old) so I think I will probably go the workspace/clone approach for now (and in fact I don't need to actually implement it first time around anyway, just planning the way forward at the moment).

                             

                            thanks again.