2 Replies Latest reply on Mar 5, 2012 4:18 AM by fcarriedos

    Use case for Modeshape

    fcarriedos

      Hi there,

       

      i have been reading about modeshape and checking the documentation and, at least about functionality, it seems to be what i am looking for. Nevertheless, since i read some use case assessment in this forum i find interesting presenting mine to finally make my opinion about what would take to build the system i need.

       

      Let's take it in parts. About data structures, file sizes and user volumes i present the following requisites:

       

      - store files that will range from 1 KByte to 30 MBytes.

      - store metadata for each file, about 10 properties each file

      - very high users volume ranging from 10K - 1M users

      - some workspaces (not a clear idea yet, but at least to of them)

      - automatic metadata extraction

       

      About the architecture, i would like to achieve a physically distributed repository working as follows:

       

      - clusterized modeshape working on JBoss AS server instances, preferrably providing fault tolerance and sticky sessions

      - a load balancer for the users to HTTP GET their files (maybe put them through Webdav too, but i found some issues working with Nginx when writing, don't know if any other load balancer works well with webdav methods used by modeshape)

      - a central system that will read / write the same files described above (locking will be useful), through the same load balancer using Webdav

      - a shared disk drive accessed through NFS for the file repository (the main idea is to provide enough network interfaces to the machine containing the drive to exploit the entire bandwitdh of the drive in the internal bus of the machine)

      - a shared database for the metadata repository

       

      About security:

       

      - both authentication and authorization for the volume of users i described (for example ACL managing is not implemented through Webdav in other JCR 2.0 implementations such as Jackrabbit)

       

       

      Further questions:

       

      - i find pretty interesting for my scenario not copying a file twice (SHA-1 hash based), to save bandwidth in some determined situations, how does it work? the webdav client should calculate the hash, query the repository to know if it is already present and, in case it is, do not transfer the file and just create another "link" for the existing one? would correctly work with the federated connector joining the database and file system connectors?

      - garbage collection, is a file actually deleted from disk when it is deleted from the repository (no references to it) by running a garbage collector?

      - could the write / read methods performed by the central system be done directly to the shared drive and then Modeshape take care of completing the metadata work alone? synchronization issues?

      - so far, i have read about the database connector (appropriate for the metadata i guess), the filesystem connector (appropriate for files i guess) and the federated connector (to join the two prior connectors and offer a single repository interface to the logic layer on top of Modeshape, the "magic" that achieves the transparency for the developers is right here, isn't it? synchronization issues here?)

      - monitoring facilities throgh the JBoss console, that i find pretty interesting as well, would work fine in the described clustered scenario too?

       

      First of all, if you reached this point within the post, thank you very much!!!

       

      Please find attached some images to support the explanations i wrote above.

       

      Thanks in advance for your responses, every contribution is explicitly welcome!!!

        • 1. Re: Use case for Modeshape
          rhauch

          This sounds like a very interesting system, and JCR and ModeShape sound like a good fit.

           

          - i find pretty interesting for my scenario not copying a file twice (SHA-1 hash based), to save bandwidth in some determined situations, how does it work? the webdav client should calculate the hash, query the repository to know if it is already present and, in case it is, do not transfer the file and just create another "link" for the existing one? would correctly work with the federated connector joining the database and file system connectors?

          Although ModeShape does internally store BINARY and large STRING values keyed by their SHA-1s, this is mostly an implementation detail in 2.x and thus it is much harder for an application to take advantage of this. Firstly, in this case the client would have to precompute the SHA-1, send it to the server, the server would have to trust the client and then find whether that SHA-1 is already stored and communicate this to the client, and the client would then behave accordingly. This may be possible/feasible if you have complete control of all clients, but if not then it might be harder to reliably and securely get working.

           

          Secondly, there are challenges with leveraging ModeShape's internal storage mechanism for your business logic. It's much harder to figure out if/whether a SHA-1 is used and, if so, how to get the BINARY value. Only the disk connector stores the BINARY and large STRING values using the SHA-1s in a way that you can (easily) access by other server code, because of the way it stores content on the disk.

           

          (Note: This is changing a fair amount in 3.0, which is still in alpha ATM. We may introduce a ModeShape-specific public API for finding/getting the (read-only) BINARY value by the SHA-1 hash, and this would mean you could more easily reuse ModeShape's internal SHA-1-based storage. Unfortunately, retrofitting this into the 2.x codebase would be extremely difficult and involved.)

           

          Of course, another option is to implement in your application logic how to store files using SHA-1s using only the JCR API. For example, you could calculate the SHA-1 and store the files in a separate node hierarchy leveraging the SHA-1 structure -- e.g., use first two chars of the SHA-1 to dictate the name of the first "folder" level, the next two characters the second "folder" level, same for the third, and finally the remainding (or full characters) in the fourth "file" level. You could then use this node hierarchy within your application to easily find content by SHA-1. It'd also be very easy to store additional, content-specific metadata in this separate area. You could even use the file system connector on this separate area, allowing your WebDAV layer to easily access it. Your regular content could have properties to capture the SHA-1 of referenced files, and you could use queries to find what other content references/uses a particular SHA-1.

           

          - could the write / read methods performed by the central system be done directly to the shared drive and then Modeshape take care of completing the metadata work alone? synchronization issues?

           

          If your application owned the management of the files and used the file system connector to back the node hierarchy, the "nt:folder" and "nt:file" nodes in this hierarchy would be written to disk as folders and files. That means your application could write files there, and the corresponding "nt:file" and "nt:folder" nodes would automatically appear to the JCR layer.

           

          - so far, i have read about the database connector (appropriate for the metadata i guess), the filesystem connector (appropriate for files i guess) and the federated connector (to join the two prior connectors and offer a single repository interface to the logic layer on top of Modeshape, the "magic" that achieves the transparency for the developers is right here, isn't it? synchronization issues here?)

          This is correct. In fact, in my outline above, I'd recommend using the file system connector for the portion of the repository where you're storing your SHA-1s, using another connector (JPA connector, for example) for everything else, and using the federation connector to wire the two together.

           

          - monitoring facilities throgh the JBoss console, that i find pretty interesting as well, would work fine in the described clustered scenario too?

          Yes, monitoring works in clustered scenarios, too. Note that our RHQ plugin also works with JON, which is the JBoss product for monitoring and management of clusters.

           

          Here are a few more miscellaneous answers (out of order):

          1. You could use sequencers for "automatic metadata extraction" - that's exactly what they were designed to do.
          2. You can use the authentication/authorization providers we supply, or you can implement one that is entirely custom and that would give you more control over acess.

           

          As I said earlier, this seems like a pretty good fit for ModeShape. Of course, the best idea is to build a prototype to see for yourself, to identify any potential issues, and to check for performance.

           

          I hope this was helpful! Best regards.

          • 2. Re: Use case for Modeshape
            fcarriedos

            First of all: thank you very much Randall for your extended response, was really useful and served as an orientation.

             

            As you suggested, i have been playing around a little bit with modeshape and it seems to be really what i was looking for. So far:

             

            - have a repository using filesystem connector

             

            - REST to operate (instead of webdav, which causes trouble with the load balancer, BTW i was using -Nginx-)

             

            - Filesystem connector, checked how adding files on the directory is seen by Modeshape it is directly exposed as JCR content (so, the baseline architecture i described here would be possible. What about scalability then? I mean about the sequencers performance).

             

            - Just read something about Aperture as the repository must provide reliable metadata on the contents it stores (if there's a better option please point it out).

             

            - What about JBoss 7, is Modeshape ready to be deployed on it? If not maybe i can contribute a little bit on this since we are migrating all our apps to JBoss 7, but i can not promise as it doesn't depend only on me...

             

            - Do you find unfeasible the architecture described in the attached figure on this new post? Your response would be really meaningful as you know many implications i will presumably not see... As soon as i ensure the correctness and feasibility of such architecture i will start working on it.

             

            - Despite having files persisted on my repository, i want to work with JCR nodes, sessions and so on... With the deployment i am using right now (modeshape-services.jar for the standalone service within JBoss and modeshape-rest.war to get the REST API), to operate the repository using JCR API through the REST API i need to follow these steps described here and i will get a repository object for the developers to get the repository handle, define when create/save/close sessions, manage nodes and their properties... An appropriate API for the development team, am i right?

             

             

            Retrieval of contents to the end customer

            -----------------------------------------------------

             

            - Retrieving the files stored in the repository to the final customer (through web browsers and so on) would be as simple as writing a servlet that gets the file from whatever interface (REST in my case), checks for the user to be correctly authenticated-authorizated, takes the binary data to drop it within the response (and eventually sets the appropiate headers based on the metadata). Isn't it?

             

             

            The options you mentioned

            -----------------------------------

             

            If i understood correctly there were two suggestions in your response about leveraging SHA-1 to save bandwidth:

             

            - use the filesystem connector to store the files within a SHA-1 keyed folder structure to keep low the complexity of searching.

             

            - maintain the metadata in a parallel structure for the hashes and metadata with the JPA connector, use the filesystem connector for the files and wire them up through the federated connector to offer both as a whole to the next layer.

             

            If i understood correctly, would this be (fully, with no restrictions) operable through the REST interface?

             

            Using just the filesystem connector (that is exactly what i am doing now, since i commented the JPA connector) would result in getting only the metadata coming from the filesystem, right? To get your suggestion working (the second option above) i would need to enable somewhere to store the metadata that right now is not available, that's why, despite of having some sequencers up and running (as i see them in the administration console within JBoss) such data is stored nowhere and then i do not get it in the Json obtained when requesting the repository through curl. Is this so?

             

             

            Re-entering questions

            ----------------------------

             

            - Just to be sure, this http://modeshape.wordpress.com/2009/01/21/preview-of-upcoming-features/ (seek SHA-1 in the text) refers to not storing the same file twice it is already "within Modeshape", i mean, already transferred through the net in my scenario and then no bandwidth is saved, isn't it?

             

            - I assume that sequencers will extract metadata and store it in the JPA datasource both placing the files either through any Java interface (REST, Webdav...) or placing files directly in the filesystem, right?

             

            - Assuming the clients of the JCR repository (let's forget about the final users by now, just instances of our own system) will be able to compute the SHA-1 hash with no harm to performance (they will have access to the whole file to be uploaded anyway) so, what about adding a property when saving a node that stores such hash and then using the search engine to determine in next transfers if the file is already present (create a link then) or not (uploading the file for the first time). I assume that searching in the volume of files i mentioned would be complex and maybe that the price of saving some transfers would be degrading the performance from the beginning (a complex search for each and everyone of the transfers, isn't it?). Would this last issue also apply to the retrieval of the SHA-1 hash you mentioned about Modeshape 3.0?

             

             

             

            Thanks for your attention in advance!!!