-
1. Re: Which Connector?
lexsoto Dec 14, 2011 8:30 AM (in response to bwallis42)1 of 1 people found this helpfulI would suggest looking into HBase (Hadoop Database) it does support versioning and, while not full transaction semantics, single record updates are atomic. Volume would not be a problem, as well as replication and fault tolerance.
Cheers,
Alex
-
2. Re: Which Connector?
rhauch Dec 14, 2011 8:57 AM (in response to lexsoto)1 of 1 people found this helpfulJCR might be a better fit if you're storing much more metadata, or even for just storing the metadata. ModeShape 2.x is also not able to participate in XA transactions. So given how you've described the system, I agree with Alex that other technologies (like HBase or even Gluster) might be a better fit.
Having said that, this is a use case that we'd like to support with ModeShape 3, which we're working on now. So if it's possible, I'd love to hear more detail on the hierarchical structure, the sizes of the files, access patterns, etc.
ModeShape 3 uses a completely different architecture and is using an Infinispan data grid for persisting content. Infinispan is massively scalable and fault tolerant, both achieved through distributing data across one or more clusters. ModeShape 3 will be able to use Infinispan just as a cache (where a subset of the total content is kept in memory within the Infinispan grid) or as its store (where all of the content is stored in the Infinispan in-memory grid, optionally backed by another persistence mechanism). ModeShape 3 will also be trasactional and able to participate in XA transactions. Finally, ModeShape 3 will allow binary values to be stored separately to the rest of the content, with options for storing in a different Infinispan cache, on the file system, in a database (post 3.0), in Hadoop/HBase (post 3.0), or using a custom solution. We're planning on releasing our first alpha in a few weeks.
Best regards,
Randall
-
3. Re: Which Connector?
bwallis42 Dec 20, 2011 11:03 PM (in response to rhauch)Thanks Alex, Randall,
I have some time to look at alternatives and also to consider ModeShape 3 as I have a project delivery date of end 2012. We are in preliminary investigation mode at the moment. I keep coming back to Infinispan and have previously looked to Hadoop but it is time to have another look. I will keep an eye out for ModeShape 3, sounds interesting.
thanks for the help
regards
brian...
-
4. Re: Which Connector?
rhauch Dec 21, 2011 8:52 AM (in response to bwallis42)Hi, Brian:
I have a few questions to help me better understand the use case to help our ModeShape 3 work. If you can't answer them, that's okay.
The documents being stored have a reasonably natural heirarchical structure and can have a few dozen metadata values associated with each document. The heirarchy can be quite wide at the second level (up to a Million maybe although we may be able to split that over two levels).
Is the content accessed by lookup (e.g., by identifier or path)? Will the metadata be queried, and/or will full-text search be used over the files? Will all children under a certain parent have unique names, or will same-name-sibling indexes be used? A million child nodes under a single parent should be possible, since ModeShape 3 breaks up the list of child references into blocks. (See this page in our very-much-a-draft 3.0 documentation for an explanation of how it works.) Having said that, the large numbers of children will add some overhead to some operations (e.g., getPath() and getName()) compared to smaller numbers of children, so we'll have to see what that's like as we do more testing. However, looking up a node by identifier should be quite efficient, as should looking up any children of that node.
Documents are also versioned so once a document is stored it is not updated or replaced, it might be superceded by a newer document but I need to be able to find and refer to the older version(s). Our current system (proprietary filesystem based) does this using a link in the metadata of a new version to the older version of the document.
You talk about documents not being updated or replaced, but might they be moved? As to referring to older versions, JCR REFERENCE properties should be a great way to do this, as dereferencing and finding which nodes reference a particular node are both very efficient and direct operations.
Again, I really hope that this kind of use case is a sweet spot for ModeShape 3, and that you'll give it a try.
-
5. Re: Which Connector?
bwallis42 Dec 21, 2011 7:07 PM (in response to rhauch)Is the content accessed by lookup (e.g., by identifier or path)? Will the metadata be queried, and/or will full-text search be used over the files? Will all children under a certain parent have unique names, or will same-name-sibling indexes be used? A million child nodes under a single parent should be possible, since ModeShape 3 breaks up the list of child references into blocks. (See this page in our very-much-a-draft 3.0 documentation for an explanation of how it works.) Having said that, the large numbers of children will add some overhead to some operations (e.g., getPath() and getName()) compared to smaller numbers of children, so we'll have to see what that's like as we do more testing. However, looking up a node by identifier should be quite efficient, as should looking up any children of that node.
The main mode of access is via the path with unique names. Same name siblings are in general not required. The documents being stored are patient records, in particular scanned documents (jpegs, tiffs, pdf, etc.) So a typical path would be
/<ns>/<patientid>/<admissionid>/<documentid>/page1/side1.jpg
Where
ns is the namespace (a handfull of these are generally defined)
patientid is a unique patient identifier, there can be 1M of these but spread across a number of namespaces
admissionid is a unique identifier for an admission to the hospital
document id is a unique identifier for a document
Most metadata about patients and admissions is kept externally in a more traditional JPA based store. There is a small amount of metadata on the documents including some document type information the original source of the document, maybe a document author, etc.
This covers the bulk of documents but there are some that are stored that are in either text or xml formats and we may in the future want to enable some form of search over the content of these documents
You talk about documents not being updated or replaced, but might they be moved? As to referring to older versions, JCR REFERENCE properties should be a great way to do this, as dereferencing and finding which nodes reference a particular node are both very efficient and direct operations.
Again, I really hope that this kind of use case is a sweet spot for ModeShape 3, and that you'll give it a try.
References are essential as a patient may have multiple IDs in multiple namespaces. This happens when patients attend two different places and two records are created, these records are then merged when the duplication is discovered but both IDs remain valid in to the future.
One of the IDs is designated as the the master ID and all documents should be found under this ID.
In the case of a merge the documents are currently moved into the master ID record so it displays a complete record for the patient.
At the moment we do not physically move documents, older documents may be kept on readonly filesystems and cannot be changed or moved. The relationship between a physical copy of a document in a filestore and the patient is maintained in a relational database so for the merges we just update the database to logically move the documents.
This of course brings up the point that we need to do storage management under the connectors as well. With large stores we often get the requirement from customers to store some records on slower storage and some on faster storage (cost/speed tradeoffs). At the moment this is a manual and error prone task of moving files to new filesystems and creating symlinks or new mount points so the physical filesystem path is maintained. It would be nice to make this a less difficult and more flexible task. If we go for a pure database connector then this is a database admin problem, but I'm not sure that would be the best approach for us and would certainly limit flexibility.
Modeshape 3 sounds interesting and given our timeframe is definatly under consideration.
Are there plans for a commercially supported version from RedHat running under the next enterprise version of JBoss (which I think would be version 6, based on the open source version 7.1)?
Thanks for your interest
regards,
brian wallis...
-
6. Re: Which Connector?
rhauch Dec 21, 2011 9:08 PM (in response to bwallis42)Brian, thanks for sharing all this information. It's always great to see good concrete use cases for larger repositories.
The main mode of access is via the path with unique names. Same name siblings are in general not required. The documents being stored are patient records, in particular scanned documents (jpegs, tiffs, pdf, etc.) So a typical path would be
/<ns>/<patientid>/<admissionid>/<documentid>/page1/side1.jpg
Where
ns is the namespace (a handfull of these are generally defined)
patientid is a unique patient identifier, there can be 1M of these but spread across a number of namespaces
admissionid is a unique identifier for an admission to the hospital
document id is a unique identifier for a document
From a JCR perspective, this seems like it would make a good hiearchical structure for a repository. I can see why SNS are not really applicable (which is fine). Also, path-based access would work quite well. Another option is that the external database(s) could store the string identifier for a node, allowing an application to quickly find the document (or page or whatever) by Session.getNodeByIdentifier() method, without having to use the path. Note that this is independent of whether the node is "mix:referenceable", since JCR 2.0 stipulates that all nodes have an identifier of some form.
Most metadata about patients and admissions is kept externally in a more traditional JPA based store. There is a small amount of metadata on the documents including some document type information the original source of the document, maybe a document author, etc.
This covers the bulk of documents but there are some that are stored that are in either text or xml formats and we may in the future want to enable some form of search over the content of these documents
Searching shouldn't be a problem. ModeShape's text extractors can pull out the terms and use them while indexing the node. ModeShape will work out of the box with text, XML, and quite a few others, but you'd always be able to define your own if need be.
You may also find sequencing to be useful. The XML sequencer, for example, could extract the XML structure of a file and turn that into a corresponding node structure that can be accessed, searched, etc. without having to read the file. Of course, you don't need to enable any sequencers.
References are essential as a patient may have multiple IDs in multiple namespaces. This happens when patients attend two different places and two records are created, these records are then merged when the duplication is discovered but both IDs remain valid in to the future.
One of the IDs is designated as the the master ID and all documents should be found under this ID.
Makes a lot of sense. JCR reference properties should work perfectly, if you wanted to use them.
In the case of a merge the documents are currently moved into the master ID record so it displays a complete record for the patient.
At the moment we do not physically move documents, older documents may be kept on readonly filesystems and cannot be changed or moved. The relationship between a physical copy of a document in a filestore and the patient is maintained in a relational database so for the merges we just update the database to logically move the documents.
The binary values used to store files in a repository are stored in ModeShape 3 in a binary store where they are keyed by their SHA-1 hash. Since the hash is determined by the content of the file (rather than the name or any other identifier), as long as the file content never changes, the file can always be identified and found by it's hash. For example, if you're using the file system storage option (there will be others), the files are stored in directories and filenames based upon the SHA-1 hash.
ModeShape 3 will also provide a way to obtain the hash of any binary value, so you can perform the under-the-covers management. For example, if a set of documents is deemed "old", it'd be a simple matter of figuring out where the file is stored so a process you write could copy the file into read-only storage (and after that succeeds remove it from the binary store). It also would be pretty straightforward to enable the file store to access the read-only filesystem so that the content is still visible to the repository clients. Oh, and even if you're doing all this "under-the-covers" management, the repository appears unchanged and remains accessible (though if needed you could capture on the node that a document has been moved to read-only storage).
This of course brings up the point that we need to do storage management under the connectors as well. With large stores we often get the requirement from customers to store some records on slower storage and some on faster storage (cost/speed tradeoffs). At the moment this is a manual and error prone task of moving files to new filesystems and creating symlinks or new mount points so the physical filesystem path is maintained. It would be nice to make this a less difficult and more flexible task. If we go for a pure database connector then this is a database admin problem, but I'm not sure that would be the best approach for us and would certainly limit flexibility.
Part of this depends on how it's determined to store some records on slower storage. And by "records" do you mean nodes in a repository, or uploaded files, or maybe both? One option is to add the distinction into your hiearchy design, and use federation to keep the different parts of the repository in separate stores. Another option is to use properties to denote which ones are which - this a little more involved but certainly something I'd be interested in trying to provide out of the box.
Modeshape 3 sounds interesting and given our timeframe is definatly under consideration.
Are there plans for a commercially supported version from RedHat running under the next enterprise version of JBoss (which I think would be version 6, based on the open source version 7.1)?
ModeShape 2 is currently included and fully supported in JBoss' Enterprise Data Services (EDS) (version 5.1 or later), which is an add-on to the JBoss SOA-P platform (which is built on top of EAP 5.x). The current plan is to include ModeShape 3 in the next major version of the EDS platform, which will be based upon EAP 6 (and AS7). BTW, all JBoss "products" (what we call platforms) are still open source; they're really just tested, qualified, and certified integrations of many (many!) open source projects, and come with Red Hat's outstanding support.
Thanks again for sharing all this information. And please continue to offer suggestions, requirements, and use cases - it's still possible to influence how some of the features work or how they grow over time.
Best regards