11 Replies Latest reply on Mar 20, 2008 7:05 AM by Manik Surtani

    Amazon S3 cache loader

    Elias Ross Master

      I've been playing around with the Amazon S3 service. For those of you unfamiliar, it's a distributed, reliable storage solution that's a paid service from Amazon.com.

      I thought it'd be pretty neat to be able to use the JBoss Cache API for accessing and storing objects from the service.

      I'm not sure this would be of any use to JBoss itself but I wouldn't mind putting an implementation together.

      There's a REST and SOAP API. The existing REST library Amazon provides is fairly crude but could be used as a basis. Where would this code get checked in? And where would the cache loader itself belong?

        • 1. Re: Amazon S3 cache loader
          Manik Surtani Master

          I'm happy to put this in the distro for now - perhaps later set up a separate repo for contribs like this.

          • 2. Re: Amazon S3 cache loader
            Elias Ross Master


            I committed. I'm not sure on the package/artifact naming, as it's really JBoss's version of the Amazon access classes. I'm pretty sure, though, it doesn't belong under com.amazon as I changed the "free" library quite a bit.

            • 3. Re: Amazon S3 cache loader
              Elias Ross Master

              I got the basic implementation done, though there are still some problems.

              I don't think the Amazon S3 API aligns well with the CacheLoader interface. First of all, there are too many "merge" operations, such as put(Map) or put(key, value) which require an additional HTTP request to fetch the old Map. I came up with an option to get rid of merging, so put(Map) always overwrites the old values.

              But the biggest problem is mostly to do with "getChildrenNames()" and how nodes are created. S3 allows nodes to be queried in lexicographic order. So, here's an example dump of S3 keys after put("/a/b/c") put("/a/b/d") etc.:

              3/a/b/c
              3/a/b/d
              4/a/b/c/1
              4/a/b/c/2
              ...
              4/a/b/c/1000

              So, to get children of /a/b, I look for nodes prefixed with the string "3/a/b" ...

              The "3" comes from the depth. Without including depth, you'd see leaf nodes /a/b/c/1 ... 1000. This wouldn't be great.

              So, the node depth works well, except how can I get the children of root? I look for nodes prefixed with 1/ .. But there are no such nodes. The solution is to create 1/a , 2/a/b when creating 3/a/b/c . But to ensure creation, it means extra HTTP requests for every put. It's doable but sucks. There's all sorts of possible race conditions as well for concurrent remove/put.

              I'm wondering if I should just give up on this.

              • 4. Re: Amazon S3 cache loader
                Elias Ross Master

                I figured out how to optimize it, though there are race conditions when sharing caches, but the whole thing is full of potential race conditions. :-)

                I optimized put() by creating a cheap cache of all the dummy nodes created by previous put() operations. This seemed to work well enough to pass all the CacheLoader tests.

                By the way, to test the darn thing, you have to have an Amazon S3 account.

                • 5. Re: Amazon S3 cache loader
                  Manik Surtani Master

                   

                  "genman" wrote:
                  I figured out how to optimize it, though there are race conditions when sharing caches, but the whole thing is full of potential race conditions. :-)


                  You could mitigate some race conditions by using the StripedLock class in the cache loader, but this will only prevent races on the cache loader on a single instance. In a cluster, we'd need some sort of distributed locking to do this - either that, or for S3 to support transactions. :-)

                  "genman" wrote:

                  By the way, to test the darn thing, you have to have an Amazon S3 account.


                  Hmm, a good reason then to have the tests disabled by default for automated test runs, etc.

                  • 6. Re: Amazon S3 cache loader
                    Elias Ross Master

                    Replying to the issue comments...

                    I added pooled HTTP connections. Didn't help performance, though.

                    I will add documentation.

                    Not sure I have the time to add an S3 emulator.

                    • 7. Re: Amazon S3 cache loader
                      Elias Ross Master

                      Manik,

                      One big thing is I don't know what to do with the svn:/jbosscache/amazon-s3 library. It needs to be built/tagged and put in the JBoss Maven repository for the JBossCache build to work.

                      Also, we may want to change the package name from "com.amazon.s3" to something else. Yes, the code was originally from Amazon but it's been significantly rewritten.

                      • 8. Re: Amazon S3 cache loader
                        Manik Surtani Master

                        If the tests written for it can be done using an emulator so it can be run along with JBC-core unit tests, I'm happy to include the cache loader into core. It will then be built/tagged/distributed along with JBC-core.

                        Regarding package names: org.jboss.cache.loader?

                        • 9. Re: Amazon S3 cache loader
                          Elias Ross Master

                          The emulator is a tall order, though I will see if I can come up with a simple one that covers the use cases of the cache loader.

                          The S3 library is about 30 or so classes, and separate is the loader which is like 2 or 3 classes. I don't want to bloat the cache distro with all that if possible.

                          Will you be willing to help me with getting the S3 library built/deployed on the JBoss repository once the emulator is done?

                          • 10. Re: Amazon S3 cache loader
                            Elias Ross Master

                            Forget putting my S3 library on the JBoss repository. I got my personal repository set up. And I got a half-decent emulator running and the tests pass. :-)

                            I'm going ahead with the check in.

                            • 11. Re: Amazon S3 cache loader
                              Manik Surtani Master

                              I was going to say, such an emulator would probably find itself quite popular among folk developing stuff for S3. You should host it as a separate library - ping the JBoss Labs folk if you want to host it there, otherwise the myriad of other OSS forges around.