9 Replies Latest reply on Sep 27, 2011 4:23 AM by sverker

    Amazon S3 Connector

    sverker

      Hi

      I'm currently working on a ModeShape connector for Amazon S3 service, which was based on the FileSystem connector. There are still some work to do, all test cases doesn't pass yet and there are some optimization work.

       

      My question, would it be interesting for you to include this as one of the standard connectors in ModeShape distribution when it is finished and if so what would be the process?

      /SVerker

        • 1. Re: Amazon S3 Connector
          rhauch

          Hi, Sverker. A ModeShape connector for Amazon S3 would indeed be very interesting, and we'd very much be interested in including it if you are willing to contribute it to the community and we're able to work out the testing challenges.

           

          For example, are your tests running directly against S3? Is there any way to test this without requiring direct S3 access? And how does S3 authorization work for builds, especially for automated CI builds on remote Hudson boxes or when somebody downloads the sources and tries to build? I suspect that none of these are roadblocks - I just haven't really spent much time thinking about what's needed.

           

          Also, we recently introduced the disk connector, which is far faster, more efficient, and more capable than the file system connector. The disk connector is able to store any and all content, whereas the file system connector is limited to just storing "nt:file" and "nt:folder" nodes. It may be just as easy to base your connector on the disk connector, and might be something to consider.

           

          WDYT?

          • 2. Re: Amazon S3 Connector
            sverker

            Hi Randall

            from my point I would like to contribute it as the more it's used the more stability and performance will be improved. Testing might prove a challenge though as currently I'm not aware of a way to create a mock Amazon S3 service, i.e. access to the real service is needed which requires access credentials. For the unit tests now I've put the credentials in a properties file but that would not be possible to commit. That's something to think about how to solve.

             

            Intresting to hear about the disk connector, I'll take a look at it and see if I should port my S3 specific parts over to it instead of basing off the file system connector as it has been a topic I've been thinking about how to handle other node types than nt:file and nt:folder.

             

            /Sverker

            • 3. Re: Amazon S3 Connector
              rhauch

              I'll talk to some other folks in JBoss to see how they test with S3 or if they have any suggestions.

               

              As far as credentials, you might consider parameterizing the properties file (in much the same way we do with the JPA connector's "database.properties" file). The "testResources" fragment in connector's POM file copies this file during builds and replaces all Maven properties with their values before placing it in the target area. This way, you can specify default (perhaps empty) properties in the POM (or maybe even parent POM) and even use the "-D" option on the command line to override the property values.

               

              Also, you may want to consider subclassing the disk connector to reduce the amount of duplicate code you'd have to maintain. If the methods aren't quite what you need for easy subclassing, feel free to make suggestions.

              • 4. Re: Amazon S3 Connector
                sverker

                I've now had a look at disc connector and find this statement:

                 

                 

                content stored by this connector is not intended to be accessible to other applications unless they integrate with ModeShape to read the data.

                 

                One of my requirements for this connector is to be able to access the data also through other tools, like e.g. S3 Organizer, and the media files stored on S3 will also be exposed through CloudFront. That was why I selected to use the filesystem connector (also because it was the recommended approach in the ref guid section 4.3.1..). I liked the approach of keeping the files and folders in the standard file system and storing extra properties in a separate file, that way it's even possible to manualy edit them if needed.

                 

                With that said however I see the disc connector code is far simpler than the file system connector, It's based on the MapWorkspace, which should be a better match as S3 is also key based. Would it be possible, without too much effort, to meet this requirement of working with the standard data set of S3 in a connector based on disc connector or is it tied to using it's own data format?

                • 5. Re: Amazon S3 Connector
                  rhauch

                  The connectors work in fundamentally different ways.

                   

                  Firstly, the disk connector stores each node in a file on disk, where the path of the file is derived from the node's UUID and not from the node's path. The file system connector also stores each node on disk based upon the node's path: an "nt:folder" is stored as a directory and an "nt:file" node is stored as a file.

                   

                  Secondly, the disk connector can store nodes of any type, whereas the file system connector can only store "nt:file" and "nt:folder" nodes, albeit with any properties. The "nt:file" node's content (that is, the "jcr:data" property on the "jcr:content" child) is actually stored in the backing file that has the same name. Standard "nt:file" and "nt:folder" properties are derive from the backing file (e.g., last modified, etc.), whereas the remaining properties are stored in another file adjacent to the actual file.

                   

                  Thirdly, the disk connector separates out "large values" and doesn't store them in the backing file for the node. Instead, it computes the SHA-1 of the value and stores these "large values" in a completely separate area keyed by their SHA-1. The file system connector has no need to do this, since it stores the "nt:file" node's content directly on disk.

                   

                  Both connectors can be subclassed to provide exactly the kind of functionality you want, but which you start with depends on what's important to you. For example, if you are just storing files and folders and indeed want to access the underlying files via S3 Organizer and CloudFront using the same logical paths seen in JCR, then the file system connector is probably a better fit. On the other hand, if you are storing content in JCR (and thus in S3) and could actually access the file content (using S3 Organizer and CloudFront) via the file's SHA-1, then the disk connector might actually work really well. I do have to say that I'd wager the file system connector is probably much closer to doing what you are trying to do: simply store/access the same files and directories stored in S3 and that you also are accessing via CloudFront.

                   

                  Again, some methods might be private or some functionality might not be extracted to easily overridden methods, but that's all easily changed. (You can make these changes, too; simply follow the ModeShape Development Workflow; I can help guide you if you want to do this.)

                  • 6. Re: Amazon S3 Connector
                    sverker

                    I already have an implementation based on filesystem connector and in progres of making sure all testcases pass. The code is a bit complicated at some places and the disc connector is much simpler so therefore I was considering that maybe for the next version it could be better to use that as a base but I'm not so sure now that I would be able to maintain the data as I want it with that approach.

                    • 7. Re: Amazon S3 Connector
                      sverker

                      In my Workspace implementation (derived directly from FileSystemWorkspace) in the putNode method there are lines like this:

                       

                       

                       

                      if (JcrNtLexicon.FILE.equals(primaryType)) {

                       

                      I.e. it's looking specifically for nt:file. Likewise for the other supported node types nt:folder and nt:resource.

                       

                      Instead I want this code to check if the primaryType is nt:file or a subclass of it. With the standard jcr api I would do that with javax.jcr.nodetype.NodeType.isNodeType(java.lang.String nodeTypeName) but as I can see that is not availible for the connector.

                       

                      How can I solve this problem?

                      /Sverker

                      • 8. Re: Amazon S3 Connector
                        rhauch

                        Instead I want this code to check if the primaryType is nt:file or a subclass of it. With the standard jcr api I would do that with javax.jcr.nodetype.NodeType.isNodeType(java.lang.String nodeTypeName) but as I can see that is not availible for the connector.

                         

                        This is one of the shortcomings of the graph API - it doesn't know about node types. Of course, anything that's explicitly specified on the "jcr:primaryType" or "jcr:mixinTypes" properties will be readily accessible, but any implicit types (such as supertypes) are not currently accessible at the connector layer.

                         

                        There really are several ways to work around this. None of these is ideal, but I've listed them below in the order that I think is preferable:

                         

                        1. Use the 'nt:file' and 'nt:folder' node types for the primary type, but then extend your functionality through mixins. This is actually perfectly legit and some folks think this may actually be a more desirable design pattern. I tend to like this for a lot of situations, especially when using 'nt:unstructured' for a primary type.
                        2. Set a property on your connector that lists the node types that should be treated as files vs folders.
                        3. Explicitly specify a mixin on each node that gives your connector a hint as to what it is.

                         

                        Sorry I didn't have better news.

                         

                        Best regards,

                         

                        Randall

                        • 9. Re: Amazon S3 Connector
                          sverker

                          I see..

                           

                          I think I'll use a mix of the above. I'll add most of my extra functionality as mixins, which would be a better design pattern actually, then I'll add configuration properties for node types that should be treated as nt:folder, nt:file or nt:resource where nt:folder will be the default type, unless other specified. I was thinking of treating any subtype of nt:hierarchyNode as folder unless it's specified that it should be nt:file or nt:resource but as I can't know that it's a subtype that won't work... The above methology will work fine.

                           

                          Any other node types than the above will have their properties stored in files ending with .modeshape.xml or .modeshape.properties (not sure if the format should be xml or just a plain properties file, maybe make it configurable).