7 Replies Latest reply on Aug 6, 2012 10:59 AM by rhauch

    Infinispan Binary Store - Preview

    tfromm

      Hi,

       

      I've agreed rhauch to provide an BinaryStore impl for Infinispan.

      Today I've commited at my branch a first shot:

       

      https://github.com/kosch/modeshape/commit/2cfe15c78b3a60ab82c15dfa46afd01c3e9437f5

       

      To the impl:

      As you can see I've not used GridFS of Infinispan at least due lack of capability of storing additional properties.

      However I use the same concept: One cache for the metadata of the blob, 2nd cache for the blob itself divided into chunks. The default chunk size is now 5 MB, means an entry at the blob cache can have a size up to 5MB. Maybe reduce the size to 1MB? Maybe easiert to calculate when BLOB-Cache has eviction configuration... (until Infinispan provides size-based eviction :-)

       

      Several small TODOs are still placed inside of the source, most of them because I need to know additional details:

      - BinaryStore.storeValue: Should the extraction of text initial also executed here?

      - Has modeshape some kind of central tmp directory except the default on uses in the jvm?

      Most of the others will be answered when MODE-1586 is fixed.

       

      To the tests:

      I've started to implement AbstractBinaryStoreTest to collect all tests and to be able to execute them for different BinaryStore Impls. Since most of the tests are for all BinaryStores the same and at the end, I needed an easy way to test the InfinipspanBinaryStore with different Infinispan configurations. Maybe DatabaseBinaryStore and so on can be migrated to extends also from AbstractBinaryStore...

       

      The biggest point still open: The integration into Modeshape configuration. Maybe this is better to discussed in #modeshape. I need at least 2 cache names.

       

      Since I want to get most of the TODOs and other open points done before I create a pull request, I'd like to have some input from you at the early point.

       

      Any comment appreciated :-)

       

      --tf

        • 1. Re: Infinispan Binary Store - Preview
          hchiorean

          Some questions/answers:

           

          • regarding the testing, we have this: https://issues.jboss.org/browse/MODE-1590 which will hopefully unify all the testing for the various binary stores implementations and provide an easy way to use different configurations
          • what kind of data would the "metadata-cache" store  for a binary value ?
          • storeValue does trigger text extraction, but this is not something that different binary store implementations should worry about. The text extraction process should work the same way, regardless of the implementation, by extending AbstractBinaryStore. The different implementations should only "worry" about the storeExtractedText and getExtractedText methods. I guess this shows that there is a certain degree of "responsibility ambiguity" among the BinaryStore interface and AbstractBinaryStore class.
          • 2. Re: Infinispan Binary Store - Preview
            rhauch

            Thomas Fromm wrote:

             

            Hi,

             

            I've agreed rhauch to provide an BinaryStore impl for Infinispan.

            Today I've commited at my branch a first shot:

             

            https://github.com/kosch/modeshape/commit/2cfe15c78b3a60ab82c15dfa46afd01c3e9437f5

             

            Very nice. BTW, you can create a pull-request and in the description say that you're seeking feedback and that it's not ready to be merged. The pull-request will be automatically updated when you make changes locally and either push new commits to the same branch or amend and push the last commit (effectively rewriting it). The benefit of this is that it's easier to get line-by-line comments (e.g., what GitHub calls "line notes").

             

             

            To the impl:

            As you can see I've not used GridFS of Infinispan at least due lack of capability of storing additional properties.

            However I use the same concept: One cache for the metadata of the blob, 2nd cache for the blob itself divided into chunks. The default chunk size is now 5 MB, means an entry at the blob cache can have a size up to 5MB. Maybe reduce the size to 1MB? Maybe easiert to calculate when BLOB-Cache has eviction configuration... (until Infinispan provides size-based eviction :-)

            That looks good. The Infinispan folks describe GridFS as a "prototype", so I think it's certainly okay to do it the way you did. Regarding eviction, can't we use a LRU strategy for the cache. Yes, it requires a bit of computation, but I think people should plan on tuning the configuration (and the chunk size, the number of entries, eviction strategy, etc.) for their environment. (Once we start performance optimization, we may find that certain values may be optimal for lots of situations, in which case we can change the defaults).

             

             

            The biggest point still open: The integration into Modeshape configuration. Maybe this is better to discussed in #modeshape. I need at least 2 cache names.

            It's okay to have two cache names; we do that when configuring ModeShape to store the Lucene indexes in Infinispan; see here for the lines in our JSON Schema. Any chance that two of those three cache names would work for the Infinispan binary store?

            • 3. Re: Infinispan Binary Store - Preview
              rhauch

              - Has modeshape some kind of central tmp directory except the default on uses in the jvm?

              Considering that both the DatabaseBinaryStore, InfinispanBinaryStore, and the MongodbCacheStore all need some form of temp space to write the content before storing it (e.g., to compute the BinaryKey), perhaps we need a new subclass of AbstractBinaryStore that encapsulates this. Not sure if that would really simplify all three, but it might be worth considering.

              • 4. Re: Infinispan Binary Store - Preview
                rhauch
                • storeValue does trigger text extraction, but this is not something that different binary store implementations should worry about. The text extraction process should work the same way, regardless of the implementation, by extending AbstractBinaryStore. The different implementations should only "worry" about the storeExtractedText and getExtractedText methods. I guess this shows that there is a certain degree of "responsibility ambiguity" among the BinaryStore interface and AbstractBinaryStore class.

                It does seem like we need to refactor the base class. If we really think the methods implemented in AbstractBinaryStore will never need to be overridden, then the easiest approach might be to simply mark them as "final".

                 

                Alternatively, we could extract the basic logic operations (e.g., storeExtractedText and getExtractedText) used in the AbstractBinaryStore to a separate interface/class, which the concrete subclass of AbstractBinaryStore provides in the constructor. This is less straightforward.

                 

                Any other ideas? On one hand, I'd like to improve this framework; on the other, I don't know that we'll really have that many more implementations. (Perhaps I'm wrong.)

                • 5. Re: Infinispan Binary Store - Preview
                  tfromm
                  • regarding the testing, we have this: https://issues.jboss.org/browse/MODE-1590 which will hopefully unify all the testing for the various binary stores implementations and provide an easy way to use different configurations

                   

                  Ok, hm. Nobody works on it, so the AbstractBinaryStoreTest could be used to fix also this issue. Need just to extend the current tests (some are still missing) and port the other BinaryStore Impls.

                   

                   

                  • what kind of data would the "metadata-cache" store  for a binary value ?

                  The Metadata object holt in the cache contains stuff like unused timestamp, last modified, size, mimetype. I assume the cache for the metadata will be in the most cases a replicated cache to provide fast access to the metadata of an binary entry.

                  • 6. Re: Infinispan Binary Store - Preview
                    hchiorean

                    It seems to me as the preemptive text extraction is causing these problems: the storeValue method is something each type of binary store should implement, while on the other hand, the text extraction mechanism should always be triggered. I think we might need to either extract another class (private to the -jcr module) and use that or somehow "minimize" the scope of the BinaryStore interface

                    • 7. Re: Infinispan Binary Store - Preview
                      rhauch
                      • regarding the testing, we have this: https://issues.jboss.org/browse/MODE-1590 which will hopefully unify all the testing for the various binary stores implementations and provide an easy way to use different configurations

                       

                      Ok, hm. Nobody works on it, so the AbstractBinaryStoreTest could be used to fix also this issue. Need just to extend the current tests (some are still missing) and port the other BinaryStore Impls.

                      We just logged the issue this past Thursday. And actually MODE-1590 deals just as much with the Maven environment and being able to specify the repository configuration used for the unified BinaryStore tests as it does refactoring the current tests into a single BinaryStoreTest class (or whatever it's called). Note that each impl may still want some unit tests, but basic functionality should be tested with the BinaryStoreTest class.

                       

                      • what kind of data would the "metadata-cache" store  for a binary value ?

                      The Metadata object holt in the cache contains stuff like unused timestamp, last modified, size, mimetype. I assume the cache for the metadata will be in the most cases a replicated cache to provide fast access to the metadata of an binary entry.

                      I agree - the main reason for breaking the "metadata" cache into a separate cache is so that it can be replicated, whereas the regular "data" cache will have different modes (e.g., perhaps replicated in a small-ish cluster, and distributed for almost every other configuration).