6 Replies Latest reply on Dec 21, 2011 9:08 PM by rhauch

    Which Connector?

    bwallis42

      I need to store a large number of documents (20 million +) where the documents range from a few kbytes to a few mbytes (the repository could grow to 20TB), the documents need to be persisted for many years (15+), I need some form of replication across at least two systems and I need transactional semantics.

       

      The documents being stored have a reasonably natural heirarchical structure and can have a few dozen metadata values associated with each document. The heirarchy can be quite wide at the second level (up to a Million maybe although we may be able to split that over two levels). Documents are also versioned so once a document is stored it is not updated or replaced, it might be superceded by a newer document but I need to be able to find and refer to the older version(s). Our current system (proprietary filesystem based) does this using a link in the metadata of a new version to the older version of the document.

       

      Other parts of the application are using JPA into a relational database (Postgres 9.x) and the transactions will need to span storage across the domain objects (JPA), the message queues (ActiveMQ, persisted in Postgres) and the repository.

       

      I've read the "How to Select the Right Connector" article and only the JPA connector stands out as being suitable but we are going to be storing quite large BLOBs at times. Our average document size is probably 300-400 Kbytes.

       

      Not all the data needs to be stored via a single connector to a single data store, we can have multiple via the federated connector.

       

      I would appreciate any advise on how modeshape might be used given the above requirments or if I should consider some other way of doing this.

       

      thanks

      brian wallis...

        • 1. Re: Which Connector?
          lexsoto

          I would suggest looking into HBase (Hadoop Database) it does support versioning and, while not full transaction semantics, single record updates are atomic.  Volume would not be a problem, as well as replication and fault tolerance.

           

          Cheers,

          Alex

          1 of 1 people found this helpful
          • 2. Re: Which Connector?
            rhauch

            JCR might be a better fit if you're storing much more metadata, or even for just storing the metadata. ModeShape 2.x is also not able to participate in XA transactions. So given how you've described the system, I agree with Alex that other technologies (like HBase or even Gluster) might be a better fit.

             

            Having said that, this is a use case that we'd like to support with ModeShape 3, which we're working on now. So if it's possible, I'd love to hear more detail on the hierarchical structure, the sizes of the files, access patterns, etc.

             

            ModeShape 3 uses a completely different architecture and is using an Infinispan data grid for persisting content. Infinispan is massively scalable and fault tolerant, both achieved through distributing data across one or more clusters. ModeShape 3 will be able to use Infinispan just as a cache (where a subset of the total content is kept in memory within the Infinispan grid) or as its store (where all of the content is stored in the Infinispan in-memory grid, optionally backed by another persistence mechanism). ModeShape 3 will also be trasactional and able to participate in XA transactions. Finally, ModeShape 3 will allow binary values to be stored separately to the rest of the content, with options for storing in a different Infinispan cache, on the file system, in a database (post 3.0), in Hadoop/HBase (post 3.0), or using a custom solution. We're planning on releasing our first alpha in a few weeks.

             

            Best regards,

             

            Randall

            1 of 1 people found this helpful
            • 3. Re: Which Connector?
              bwallis42

              Thanks Alex, Randall,

                 I have some time to look at alternatives and also to consider ModeShape 3 as I have a project delivery date of end 2012. We are in preliminary investigation mode at the moment. I keep coming back to Infinispan and have previously looked to Hadoop but it is time to have another look. I will keep an eye out for ModeShape 3, sounds interesting.

               

              thanks for the help

              regards

              brian...

              • 4. Re: Which Connector?
                rhauch

                Hi, Brian:

                 

                I have a few questions to help me better understand the use case to help our ModeShape 3 work. If you can't answer them, that's okay.

                 

                 

                The documents being stored have a reasonably natural heirarchical structure and can have a few dozen metadata values associated with each document. The heirarchy can be quite wide at the second level (up to a Million maybe although we may be able to split that over two levels).

                Is the content accessed by lookup (e.g., by identifier or path)? Will the metadata be queried, and/or will full-text search be used over the files? Will all children under a certain parent have unique names, or will same-name-sibling indexes be used? A million child nodes under a single parent should be possible, since ModeShape 3 breaks up the list of child references into blocks. (See this page in our very-much-a-draft 3.0 documentation for an explanation of how it works.) Having said that, the large numbers of children will add some overhead to some operations (e.g., getPath() and getName()) compared to smaller numbers of children, so we'll have to see what that's like as we do more testing. However, looking up a node by identifier should be quite efficient, as should looking up any children of that node.

                 

                Documents are also versioned so once a document is stored it is not updated or replaced, it might be superceded by a newer document but I need to be able to find and refer to the older version(s). Our current system (proprietary filesystem based) does this using a link in the metadata of a new version to the older version of the document.

                You talk about documents not being updated or replaced, but might they be moved? As to referring to older versions, JCR REFERENCE properties should be a great way to do this, as dereferencing and finding which nodes reference a particular node are both very efficient and direct operations.

                 

                Again, I really hope that this kind of use case is a sweet spot for ModeShape 3, and that you'll give it a try.

                • 5. Re: Which Connector?
                  bwallis42

                  Is the content accessed by lookup (e.g., by identifier or path)? Will the metadata be queried, and/or will full-text search be used over the files? Will all children under a certain parent have unique names, or will same-name-sibling indexes be used? A million child nodes under a single parent should be possible, since ModeShape 3 breaks up the list of child references into blocks. (See this page in our very-much-a-draft 3.0 documentation for an explanation of how it works.) Having said that, the large numbers of children will add some overhead to some operations (e.g., getPath() and getName()) compared to smaller numbers of children, so we'll have to see what that's like as we do more testing. However, looking up a node by identifier should be quite efficient, as should looking up any children of that node.

                   

                  The main mode of access is via the path with unique names. Same name siblings are in general not required. The documents being stored are patient records, in particular scanned documents (jpegs, tiffs, pdf, etc.) So a typical path would be

                   

                  /<ns>/<patientid>/<admissionid>/<documentid>/page1/side1.jpg

                   

                  Where

                    ns is the namespace (a handfull of these are generally defined)

                     patientid is a unique patient identifier, there can be 1M of these but spread across a number of namespaces

                     admissionid is a unique identifier for an admission to the hospital

                     document id is a unique identifier for a document

                   

                  Most metadata about patients and admissions is kept externally in a more traditional JPA based store. There is a small amount of metadata on the documents including some document type information the original source of the document, maybe a document author, etc.

                   

                  This covers the bulk of documents but there are some that are stored that are in either text or xml formats and we may in the future want to enable some form of search over the content of these documents

                  You talk about documents not being updated or replaced, but might they be moved? As to referring to older versions, JCR REFERENCE properties should be a great way to do this, as dereferencing and finding which nodes reference a particular node are both very efficient and direct operations.

                   

                  Again, I really hope that this kind of use case is a sweet spot for ModeShape 3, and that you'll give it a try.

                   

                  References are essential as a patient may have multiple IDs in multiple namespaces. This happens when patients attend two different places and two records are created, these records are then merged when the duplication is discovered but both IDs remain valid in to the future.

                   

                  One of the IDs is designated as the the master ID and all documents should be found under this ID.

                   

                  In the case of a merge the documents are currently moved into the master ID record so it displays a complete record for the patient.

                   

                  At the moment we do not physically move documents, older documents may be kept on readonly filesystems and cannot be changed or moved. The relationship between a physical copy of a document in a filestore and the patient is maintained in a relational database so for the merges we just update the database to logically move the documents.

                   

                  This of course brings up the point that we need to do storage management under the connectors as well. With large stores we often get the requirement from customers to store some records on slower storage and some on faster storage (cost/speed tradeoffs). At the moment this is a manual and error prone task of moving files to new filesystems and creating symlinks or new mount points so the physical filesystem path is maintained. It would be nice to make this a less difficult and more flexible task. If we go for a pure database connector then this is a database admin problem, but I'm not sure that would be the best approach for us and would certainly limit flexibility.

                   

                  Modeshape 3 sounds interesting and given our timeframe is definatly under consideration.

                   

                  Are there plans for a commercially supported version from RedHat running under the next enterprise version of JBoss (which I think would be version 6, based on the open source version 7.1)?

                   

                  Thanks for your interest

                  regards,

                  brian wallis...

                  • 6. Re: Which Connector?
                    rhauch

                    Brian, thanks for sharing all this information. It's always great to see good concrete use cases for larger repositories.

                    The main mode of access is via the path with unique names. Same name siblings are in general not required. The documents being stored are patient records, in particular scanned documents (jpegs, tiffs, pdf, etc.) So a typical path would be

                     

                    /<ns>/<patientid>/<admissionid>/<documentid>/page1/side1.jpg

                     

                    Where

                      ns is the namespace (a handfull of these are generally defined)

                       patientid is a unique patient identifier, there can be 1M of these but spread across a number of namespaces

                       admissionid is a unique identifier for an admission to the hospital

                       document id is a unique identifier for a document

                     

                    From a JCR perspective, this seems like it would make a good hiearchical structure for a repository. I can see why SNS are not really applicable (which is fine). Also, path-based access would work quite well. Another option is that the external database(s) could store the string identifier for a node, allowing an application to quickly find the document (or page or whatever) by Session.getNodeByIdentifier() method, without having to use the path. Note that this is independent of whether the node is "mix:referenceable", since JCR 2.0 stipulates that all nodes have an identifier of some form.

                    Most metadata about patients and admissions is kept externally in a more traditional JPA based store. There is a small amount of metadata on the documents including some document type information the original source of the document, maybe a document author, etc.

                     

                    This covers the bulk of documents but there are some that are stored that are in either text or xml formats and we may in the future want to enable some form of search over the content of these documents

                    Searching shouldn't be a problem. ModeShape's text extractors can pull out the terms and use them while indexing the node. ModeShape will work out of the box with text, XML, and quite a few others, but you'd always be able to define your own if need be.

                     

                    You may also find sequencing to be useful. The XML sequencer, for example, could extract the XML structure of a file and turn that into a corresponding node structure that can be accessed, searched, etc. without having to read the file. Of course, you don't need to enable any sequencers.

                     

                    References are essential as a patient may have multiple IDs in multiple namespaces. This happens when patients attend two different places and two records are created, these records are then merged when the duplication is discovered but both IDs remain valid in to the future.

                     

                    One of the IDs is designated as the the master ID and all documents should be found under this ID.

                    Makes a lot of sense. JCR reference properties should work perfectly, if you wanted to use them.

                     

                    In the case of a merge the documents are currently moved into the master ID record so it displays a complete record for the patient.

                     

                    At the moment we do not physically move documents, older documents may be kept on readonly filesystems and cannot be changed or moved. The relationship between a physical copy of a document in a filestore and the patient is maintained in a relational database so for the merges we just update the database to logically move the documents.

                    The binary values used to store files in a repository are stored in ModeShape 3 in a binary store where they are keyed by their SHA-1 hash. Since the hash is determined by the content of the file (rather than the name or any other identifier), as long as the file content never changes, the file can always be identified and found by it's hash. For example, if you're using the file system storage option (there will be others), the files are stored in directories and filenames based upon the SHA-1 hash.

                     

                    ModeShape 3 will also provide a way to obtain the hash of any binary value, so you can perform the under-the-covers management. For example, if a set of documents is deemed "old", it'd be a simple matter of figuring out where the file is stored so a process you write could copy the file into read-only storage (and after that succeeds remove it from the binary store). It also would be pretty straightforward to enable the file store to access the read-only filesystem so that the content is still visible to the repository clients. Oh, and even if you're doing all this "under-the-covers" management, the repository appears unchanged and remains accessible (though if needed you could capture on the node that a document has been moved to read-only storage).

                    This of course brings up the point that we need to do storage management under the connectors as well. With large stores we often get the requirement from customers to store some records on slower storage and some on faster storage (cost/speed tradeoffs). At the moment this is a manual and error prone task of moving files to new filesystems and creating symlinks or new mount points so the physical filesystem path is maintained. It would be nice to make this a less difficult and more flexible task. If we go for a pure database connector then this is a database admin problem, but I'm not sure that would be the best approach for us and would certainly limit flexibility.

                    Part of this depends on how it's determined to store some records on slower storage. And by "records" do you mean nodes in a repository, or uploaded files, or maybe both? One option is to add the distinction into your hiearchy design, and use federation to keep the different parts of the repository in separate stores. Another option is to use properties to denote which ones are which - this a little more involved but certainly something I'd be interested in trying to provide out of the box.

                     

                    Modeshape 3 sounds interesting and given our timeframe is definatly under consideration.

                     

                    Are there plans for a commercially supported version from RedHat running under the next enterprise version of JBoss (which I think would be version 6, based on the open source version 7.1)?

                     

                    ModeShape 2 is currently included and fully supported in JBoss' Enterprise Data Services (EDS) (version 5.1 or later), which is an add-on to the JBoss SOA-P platform (which is built on top of EAP 5.x). The current plan is to include ModeShape 3 in the next major version of the EDS platform, which will be based upon EAP 6 (and AS7). BTW, all JBoss "products" (what we call platforms) are still open source; they're really just tested, qualified, and certified integrations of many (many!) open source projects, and come with Red Hat's outstanding support.

                     

                    Thanks again for sharing all this information. And please continue to offer suggestions, requirements, and use cases - it's still possible to influence how some of the features work or how they grow over time.

                     

                    Best regards