5 Replies Latest reply on Jun 25, 2009 3:08 PM by rhauch

    JCR interface to local filesystem

    sverker

      I have a question about a use case and if it would be possible to implement with DNA:

      I have files stored in a local file system, I want a JCR interface to work with this file system. I believe the File System Connector is intended exactly for that, although it's still read only but that can be solved.

      More tricky though is that I'd like the JCR implementation to be able to access (i.e. read, write and index for search) the metadata of the files. It seam to me that the Sequencing framework is not doing quite that, instead it is intended to be used when files are put into the repository to extract metadata of the binary image and add those to the repository as jcr properties. Is it possible to instead directly access the metadata of the file and use the sequencer to index this information?

      I also want other applications to be able to work directly on the files in the local file system, adding, changing and removing files and metadata, so the JCR implementation can't create it's own data structure to store these files as Jackrabbit does (they can store the files on the local file system but it's an internal data format)

      Whould this kind of use case be possible to implement with the DNA JCR implementation?

        • 1. Re: JCR interface to local filesystem
          rhauch

          This use case is indeed one for which DNA is targeting: provide a full JCR API into the information stored outside of a JCR repository. In this case, information stored on your local file system, and the File System Connector plays a central role in this.

          The File System Connector is currently read-only, and that will hopefully change soon. (BTW, this is a great task for a new member of the community.) Then the question becomes how to expose the file metadata.

          I'm curious as to what kind of file metadata you're interested in. Certainly file-level metadata (the information owned and managed by the file system) should be available natively by the connector - if the connector is missing something, please let us know.

          Alternatively, there's metadata contained in the files themselves. An example we often use is the image metadata associated with JPG, PNG and other image files. The sequencing system is designed to extract this metadata from the content and place it somewhere in the repository (in a place that is configurable).

          At the moment, the sequencing is keyed by changes in the content, and this currently happens when the content is changed through DNA. We'd like to change that, so that as new sources are connected, they are essentially walked to run the appropriate sequencers as well as update the search indexes. We still have some work to do along this front, and it may involve some process that scans/walks the content of the connector and performs sequencing and/or indexing for search.

          However, you also mention that you'd like the metadata to be stored directly on the file system (I presume in adjacent to the actual files). With this requirement, an interim solution could be to use a custom connector (perhaps subclassing the File System Connector to inherit the existing functionality). Basically, extra properties and nodes could be managed within local files. This connector could even run the sequencers to extract meaningful metadata from the files.

          BTW, I mentioned search above. We've already starting designing how search functionality will be done, and completing that is a major goal of this next release. The requirement is that any content (whether it's served by a connector or produced by a sequencer) would be automatically indexed and used in searches.

          Hopefully this gives you more information and gives you at least a starting point. In the longer term, I'd love for this to "just work" with the file system connector and sequencers and searching. But in the shorter term, the best bet may be a custom connector (based on the File System connector). Thoughts?

          • 2. Re: JCR interface to local filesystem
            sverker

            Sounds like the approach I'm looking for.

            The specific usecase that I'm looking at is a directory structore of mp3 and m4a files. Like the image files in your example, these have metadata such as id3 and mp4 tags. I want to be able to work on these files with the normal tools and any changes should reflect transparently to the jcr interface.

            That probably mean that the File System should interact with the operating system events to get notified of changes, jnotify is a library that can do that for Linux and Windows, which triggers the sequencing framework to index the metadata.

            • 3. Re: JCR interface to local filesystem
              rhauch

              Yes, I'd love for the File System Connector to generate events based upon changes in the underlying file system. And thanks for the tip about jnotify! Any interest in helping out here? :-)

              Events definitely allow the sequencers to process any changed files. Unfortunately, there still is the issue about what to do with the content that may have changed when the repository is shut down (or processing content in a newly-added source. For example, if the repository is _restarted_, then ideally DNA would walk the content to see what has changed, and then create events for something that has changed (since the last time DNA was connected to the source).

              We have to address this for search indexing, and one way I'm mulling over is that DNA could have a service that would be in charge of walking the content and generating the "missing" events. Any other ideas?

              • 4. Re: JCR interface to local filesystem
                sverker

                Well, I just checked out the source so I'll take a look at it. The challenge is not so much to generate events but to where to pass them. As I understand it now it's the framework that triggers the sequencers, not the connector? How would this look from the connector perspective?

                About changes in the underlying data when the repository is shut down that's a general problem for any datastore which can be changed out of context. If you persist the indexes, you can never trust that they would be still valid when you restart the repository. Therefore it should be possible to mark that certain datastores needs to be reindexed when the repository is reloaded, other types of repositories are not neccesary to reindex as their data storage doesn't change out of context.

                The indexing could be handled by a service that walks through the repository when requested so, or receives events from the connector when something has changed. Those events probably should to be queued for consistancy.

                That's probably the right event flow, the indexing service listen to change events from the connectors and invokes the sequencers. Thoughts?

                • 5. Re: JCR interface to local filesystem
                  rhauch

                   

                  "sverker" wrote:
                  The challenge is not so much to generate events but to where to pass them. As I understand it now it's the framework that triggers the sequencers, not the connector? How would this look from the connector perspective?

                  The connector can generate events and hand them to the observer (passed to each RepositorySource upon initialization via the RepositoryContext).

                  "sverker" wrote:
                  About changes in the underlying data when the repository is shut down that's a general problem for any datastore which can be changed out of context. If you persist the indexes, you can never trust that they would be still valid when you restart the repository. Therefore it should be possible to mark that certain datastores needs to be reindexed when the repository is reloaded, other types of repositories are not neccesary to reindex as their data storage doesn't change out of context.

                  I agree. It's not rocket science, but we'll have to put some thought into it. The goal is that it is all automatic (apart from minimal configuration; maybe a source wants to say "don't index me") and we can keep the indexes up-to-date without having to completely re-index and re-sequence everything.

                  "sverker" wrote:
                  The indexing could be handled by a service that walks through the repository when requested so, or receives events from the connector when something has changed. Those events probably should to be queued for consistancy.

                  That's probably the right event flow, the indexing service listen to change events from the connectors and invokes the sequencers. Thoughts?

                  This is the plan: updating the index will be handled automatically via events. So as long as connectors publish events, the search indexes will get updated correctly.