3 Replies Latest reply on Jun 30, 2013 1:26 PM by rhauch

    A connector for a file system with thousands of files in subfolders

    weljo_web

      Hi,

       

      I just want to get your thoughts before we proceed with developing our own custom connector.

       

      We have a legacy file system based repository we want to be accessible as part of our Modeshape repository.  The obvious choice was to use the FileSystemConnector provided out of the box.  However, it is taking too long to access a file of the exernal repository.  Upon investigation the FileSystemConnector, for each folder in the file path, it retrieves all the contents of the folder for caching.  We're talking about 40,000+ files within a subfolder.  We tried setting the cacheTtlSeconds but since the subfolder is cached, a file added to it won't be available until the ttl expires.  Setting inclusion filters did little to improve performance as most of the items within the subfolder are valid.

       

      I tried creating a FileSystemConnector version, tweaking, getDocumentById( String id ), so that it won't traverse the files inside the subfolder but somewhere up the stack, caching is taking place that subsequent calls to the folder returns a childless node.  I'm currently looking at how the Pageable interface works (as in GitConnector) and see if that is the implementation to follow.


      thanks in advance for your suggestions!

       

      regards,

      Joel

        • 1. Re: A connector for a file system with thousands of files in subfolders
          rhauch

          Yes, the connector should implement Pageable, which is the mechanism designed to prevent ModeShape from loading *all* of the children in the parent when the parent node is materialized. We should change our FileSystemConnector to implement Pageable, so feel free to add an enhancement request to our JIRA. We'd welcome any contribution for it, too (if you're interested/able).

           

          The GitConnector is the best example of a Pageable connector, though the concept is pretty straightforward. Just remember that you probably do want to load at least some number of child references when constructing the document for a parent node, although how many is a tradeoff. Loading just a small number (or even 0) child references will make it faster to create the parent node document, but any subsequent access of the child references (e.g., due to client iteration over some chidlren) will require an additional trip to the connector.

          • 2. Re: A connector for a file system with thousands of files in subfolders
            weljo_web

            Hi Randall,

             

            I created MODE-1982 to make the FileSystemConnector Pageable.  I also created MODE-1983 to suggest JcrSession.cachedNode use WorkspaceCache.getChildReference(parentKey, nodeKey) as an alternative to using node.getChildReferences(cache).getChild(segment) to lookup child nodes.  This is to avoid iterating over thousands of ChildReference to get the one matching the segment.

             

            regards,

            Joel

            • 3. Re: A connector for a file system with thousands of files in subfolders
              rhauch

              Thanks for logging those requests. Making the FileSystemConnector implement Pageable is a great idea, and something we'll definitely do.

               

              However, I'm not convinced that MODE-1983 is the correct change, but we'll definitely take a look. Whenever we have the child's key, we can actually efficiently obtain the child by using `WorkspaceCache.getChildReference(parentKey, nodeKey)'. However, when we only have the name of the child (optionally including the SNS index, whic is assumed to be 1 unless another is specified), the only option is to call `node.getChildReferences(cache).getChild(segment)'.

               

              The bottom line is that I'm not sure that the suggestion will actually help. Anytime we have the key we should be directly looking it up by the key, but if we only have the name we have no choice but to use the second form.

               

              BTW, both methods load the child by materializing the parent node, since the parent is the place where all the child reference (e.g., the key of the child and the name, without SNS index) instances are kept.

               

              • The 'WorkspaceCache.getChildReference(parentKey,nodeKey)' basically gets the parent document and then walks through the ChildReference objects looking for one with a matching key. (This is done to ensure that the node is actually a child of the parent; we used to not actually make this check, but that resulted in problems when one Session was reading children of a node and another thread moved or removed a child.)
              • The CachedNode.getChildReferences(cache) merely gets the parent's ChildReferences (plural) container and looks for a ChildReference instance that has a matching name and SNS index.

               

              The representation of the node inside the WorkspaceCache makes these operations more efficient than you might think. First of all, the cached representation (e.g., CachedNode) contains a ChildReferences implementation that, depending upon how the children are stored, will store enough information to quickly find nodes by ID and by name. Most of our implementations are in ImmutableChildReferences, including:

               

              1. the Medium implementation that has a map of children by key and a multi-map keyed by name. (All of the children with the same name are kept in the multi-map under the name; the first has a SNS index of 1, the second 2, the third 3, etc. This is how we don't actually store the SNS index anywhere, since it's implicit in the order of the children.)
              2. a Federated implementation of ChildReferences that is used only for the federated nodes (nodes that have both internal and external nodes)
              3. a Segment implementation of ChildReferences that is used when a connector is Pageable and when an internal node document stores only some of the children on the parent, with the remainder stored in additional segments (e.g., other documents inside Infinispan cache). This implementation knows how to work efficientlly with the large number of children, stored in different segments.

               

              Right now, all internal nodes are stored with all child references inside the parent's document, and thus Medium implementation is used for all internal nodes. We have code that can run in a background thread that will walk through all persisted documents to find nodes that contain lots of child references and will optimize them by breaking up the child references into multiple segments. When this happens and such a document is materialized, the Segment implementation is used.

               

              However, at this time, we've not enabled the optimization algorithm, primarily because we've not yet had time to tune it to see what segment sizes are ideal. If the segments are too small, ModeShape will need to materialize more smaller documents. If the segments are too big, ModeShape will have to materialize fewer documents, but working with each one will be more expensive.

              1 of 1 people found this helpful