5 Replies Latest reply on Nov 3, 2012 6:01 PM by bwallis42

    Binaries and JCR to backing store mapping

    bwallis42

      In the json file schema (repository-config-schema.json) the description says that minimumBinarySizeInBytes applies to both binary and string properties. I have also read somewhere that you can set the value to 0 to get all binary values stored in the binary store and I assume this means all strings as well. This would pretty much equate to all the property values in my system as most properties are string values.

       

      What I would like to do is store all binary property values into a seperate store from all the other property values including strings. Our repository only uses the binary type for jcr:data properties in nt:resource nodes and these are used to store the documents in our system.

       

      So, is there anyway I can configure this so that all documents (property values of type BINARY) are kept in a separate store from everything else?

       

      The reason for this is a desire to have all the documents stored in a single logical place to appease a customer requirement that if they want to move to a different system that all of their documents can be extracted from our system for migration into another one and without reliance on our system to do that.

       

      I will also need to be able to match some metadata against the documents to identify each one. Nothing complex, probably just a string property or two from a containing node. I could place some mapping metadata into an additional binary property so it is stored alongside the documents.

       

      Is there any documentation about the mapping from the JCR model into the storage? Is what I'm describing feasible?

       

      thanks.

        • 1. Re: Binaries and JCR to backing store mapping
          rhauch

          We use the same configuration field value to dictate which strings and which binary values are stored in the binary store, so with 3.0 it's not possible to independently configure the threshold for large strings and large binary values.

           

          When a STRING property value is encountered and it is larger than the threshold, ModeShape converts this large string to a Binary value and stores it in the binary store. Thus the binary stores is just storing binary values, so there's no specialization that can be made at the binary store level. Note that the JCR API was designed that the client always asks for the value in the type that the client wants (e.g., a STRING), and the implementation can attempt to perform the conversion. Per the JCR spec, a BINARY value can always be converted into a STRING (and vice versa), and this is why ModeShape can really stores a large string as a BINARY value without the client caring.

           

          However, I think it's fairly straightforward to add a configuration field that dictates the minimum size of the STRING values that should be converted and stored in the Binary store. If it's not there, we can default it to the existing "minimumBinarySizeInBytes" value (which itself has a default). I think this new field would give you the control you need; you would simply set the new configuration field value to a large number. The only side-affect is that large strings will always be stored with the node.

           

          If you agree with my assessment, please log a feature enhancement.

          1 of 1 people found this helpful
          • 2. Re: Binaries and JCR to backing store mapping
            bwallis42

            Done MODE-1699 created

             

            thanks

            1 of 1 people found this helpful
            • 3. Re: Binaries and JCR to backing store mapping
              bwallis42

              Do you have any documentation or suggestions on where to look in the code to understand how the data is eventually persisted in the database?

               

              From my initial examination of the storage table for the repository it looks like the keys are sometimes uuids, other times a hex number with -ref (at a ratio of 2 uuids to 1 -ref) and a serialised java object in the data column (although I couldn't pull that object apart so not sure about that)

               

              The binarydata contains the binary data stored plus something prepended to it.

               

              The binarymetadata might be a serialised instances of org.modeshape.jcr.value.binary.infinispan.Metadata

               

              As I said initially, I don't want to do anything with this data but I would like to be able to document how the data is persisted.

               

              If the data stored is indeed serialised java objects, it begs the question about upgrades and how data is migrated across changes to the storage format.

              • 4. Re: Binaries and JCR to backing store mapping
                rhauch

                We've not yet had the time to document how we store information in Infinispan. In lieu of that, here's a brief summary.

                 

                ModeShape several different kinds of "documents" in the Infinispan cache, where each Document object is serialized by Infinispan as BSON. Because BSON is a standard format, you should be able to use any code that can read BSON to extract the information. ModeShape's "Schematic" library has code that can convert between BSON and JSON (and vice-versa), and actually contains the Java object representation. Note that within memory, the Document objects are directly accessible; only when Infinispan needs to serialize them (for persistence or transport to another process) will the documents be serialized to BSON. In our discussions, we often show the documents as JSON, since that's human readable. Oh, and because it's BSON, it's independent of Java serialization and ModeShape versions.

                 

                • Every node is represented as a separate document keyed by the "node key", which is a string with the first 7 characters are the first part of the SHA1 of the "source" name, the next 7 characters are the first part of the SHA1 of the workspace name, and the remainder is the node's identifier string (often a UUID, tho special nodes like the root or system nodes will have special identifiers) that is retrievable via Node.getIdentifier(). The "source" name is the name of the Infinispan cache used for content, but external nodes (for federation) will have different source names. Some examples of the documents are given (and described) on the Federation design (for 3.1 and later) page. Note that anytime another document is referenced, the document's full key string is included in the document.
                • Normally a node document will contain references to all of the node's children. But a node that has a lot of children would result in a large document that would be more expensive to serialize and deserialize, so ModeShape can optimize this by breaking the list of child references into "blocks": the document will contain the first block of children, but subsequent blocks will be stored as separate documents and will reference the next block (like a forward linked list). Again, Federation design (for 3.1 and later) describes this a bit more and has an example document. The key for each block document is string that follows the node key format, but all identifiers of a block document's node key will be a UUID.
                • When ModeShape stores a binary value, it stores some metadata (primarily reference counts) about the binary value in a small "metadata" document. They key will be a hexadecimal SHA1 with "-ref" appended.
                • ModeShape stores a few documents containing repository metadata (e.g., workspace names, source keys, version info, etc.). These documents are keyed by well-known string constants will never clash with any of the other node keys (the first 7 characters always contain non-hexadecimal characters, and thus can never clash with any SHA1-based keys).

                 

                The backup process actuallly extracts all of these document values within the Infinispan cache (used for content storage) and writes them out to a series of files, where each file contains (usually) 100K JSON-serialized documents and is GZIP-ed. (IOW, run a backup, and try uncompressing any of the backup files, then open up the uncompressed files in an editor to see the documents. Strictly speaking, these uncompressed files are not valid JSON documents, but are actually concatenated JSON documents.) The backup also contains the binary values (in files named with the SHA1 of the value).

                 

                ModeShape persists all information in an open format so that developers can always get to all of data, though doing so may require some processing to put it into another format or system.

                • 5. Re: Binaries and JCR to backing store mapping
                  bwallis42

                  Thanks for that. Exactly the information that I need!