6 Replies Latest reply on Feb 27, 2014 11:28 AM by rhauch

    JDBC and JCR connectors

    scossu

      Hello,

      I am setting up a Fedora 4 implementation which is based on Modeshape and I want to integrate external data into the repository.

      One data source is a separate Fedora 4 repo, which I want to connect to using the JCR API, the other a PostgreSQL database.

       

      Are there some JCR or non-metadata JDBC connectors available that I can use as starting points? They look like something that should be commonly used, but I haven't been able to find any so far.

       

      Thanks,

      Stefano

        • 1. Re: JDBC and JCR connectors
          hchiorean

          We don't have such an implementation yet, but we do have it on the roadmap for 4.1: https://issues.jboss.org/browse/MODE-282 This is a read-only connector which, based on a configured query, would project into the repository the content from the DB.

          • 2. Re: JDBC and JCR connectors
            scossu

            Hi Horia,

            The discussion in that ticket is interesting. I like the idea of the GROUP BY approach.

            Even using that, though, some of the tables in the database have hundreds of thousands of records. I can break up the path based on identifiers, but I prefer keeping a predictable path and not break it down too much, so there might still be thousands or tens of thousands of nodes under some containers.

            How will that affect performance in the federated repository?

            Thanks

            s

            • 3. Re: JDBC and JCR connectors
              hchiorean

              For external sources which have large numbers of "children under a given parent", we have a set of SPIs which connectors should implement in order to expose a "paging mechanism".

               

              It all starts with the modeshape/modeshape-jcr/src/main/java/org/modeshape/jcr/federation/spi/Pageable.java at master · ModeShape/modeshape · G… interface.

              Our existing modeshape/modeshape-jcr/src/main/java/org/modeshape/connector/filesystem/FileSystemConnector.java at master · ModeShape/… and modeshape/connectors/modeshape-connector-git/src/main/java/org/modeshape/connector/git/GitConnector.java at master · Mod… are 2 connectors which use this mechanism.

               

              On a more general note though, the term "performance" in the context of federated repositories is a bit vague. There are several things to take into account, to name a few:

              * is the connector read-only or read-write ?

              * what are the "main use cases" for this repository ? The JCR spec is complex, so if there is a specific use case (e.g. creating lots of children under the same parent) that can have certain performance aspects to it.

              * is the external source "near" or "far" from the repository ? By "far" I'm referring to those sources where network trips are necessary to retrieve information

              * what is the order of importance for the performance criteria, considering that there is (at least): memory consumption, CPU time, network round-trips etc

              1 of 1 people found this helpful
              • 4. Re: JDBC and JCR connectors
                scossu

                Thanks for the explanation Horia.

                I thought about pagination too, but I was just concerned about another type of performance degradation. The federated nodes are not meant to be queried in bulk, but rather as references in local nodes. The local nodes containing the reference will be queried in large quantities, but those are already optimized. Does it make a difference if I query a single node which is in a federated container with another 300K nodes or with just a few hundreds?

                 

                Also I'm surprised that nobody developed a JCR connector before. I thought that connecting two independent JCR repositories would be a very common use case. Or is there a better way to integrate the contents of one repo into another one?

                • 5. Re: JDBC and JCR connectors
                  hchiorean

                  I originally missed the part about the JCR connector: it's also on our radar [MODE-1709] New connector to another JCR repository - JBoss Issue Tracker, but not very high. We had such connectors implemented for ModeShape 2.x, but because ModeShape 3.x was basically a re-write, the entire connector architecture has changed and these connectors need to be re-written.

                   

                  Regarding querying: atm in 3.x, ModeShape doesn't store indexes up-front for the nodes of an external system. What it does do though, is create those indexes whenever nodes from the external system are *changed* via the repository. I say *changed* because if you just iterate through the children of an external node, that doesn't mean those nodes will be indexed unless you explicitly configure/use the API to re-index stuff. If you add a property to an external node however, that will trigger indexing for that node.

                  Also, when dealing with external nodes, you have the option of making either an entire connector as non-queryable or on a node-by-node basis, meaning that ModeShape will never index external nodes.

                   

                  In 4.x however (which is what we're working on atm) the entire indexing architecture will be changed - Randall Hauch can provide more technical insight.

                  This means that tasks like [MODE-1686] Support connectors to systems that contain their own search capabilities - JBoss Issue Tracker will probably be easier to implement and in the case of some connectors (JDBC for example) may offer significant performance advantages.

                  • 6. Re: JDBC and JCR connectors
                    rhauch

                    In 4.x however (which is what we're working on atm) the entire indexing architecture will be changed - Randall Hauch can provide more technical insight.

                     

                    ModeShape 4.0 will allow you to define indexes for specific properties (or compound indexes for several properties), and then ModeShape will populate them, maintain them, and then use them when executing queries. Any criteria that cannot be matched to an index will be processed by ModeShape; that means that any query will still work even if no indexes are defined, albeit slower as that essentially means ModeShape will scan the repository. This is just like a traditional database: you should define indexes based upon your queries, and without proper indexes your queries may be slow (but they'll work).

                     

                    Of course some indexes will be built-in:

                    • a path constraint such as "... WHERE ISSAMENODE(type,'/a/b/c/d') ..." or "... type.[jcr:path] = '/a/b/c/d' ..." will be handled not by a full workspace scan but instead by a direct lookup by path.
                    • a child constraint such as "... WHERE ISCHILDNODE(type,'/a/b/c') ..." will be handled not by a full workspace scan but instead by a direct lookup of the parent and navigation of all children.
                    • a descendant constraint such as "... WHERE ISDESCENDANTNODE(type,'/a/b') ..." will be handled not by a full workspace scan but instead by a direct lookup of the ancestor and navigation of all descendants.

                     

                    Additionally, when a single portion of a query plan can use multiple indexes, ModeShape will use the least-expensive index. For example, the implicit path index described above is extremely cheap, whereas the descendant constraint is relatively expensive.

                     

                    ModeShape 4.0 will also allow you to use external systems for indexes. Out of the box you will be able to define indexes on the local file system, and we hope very soon support for Solr and ElasticSearch. Once all that is in place, then we can tackle connectors exposing their own indexes and having ModeShape also push down portions of queries (what we call ACCESS points in a query plan) to individual connectors. That's essentially what Horia mentioned here:

                     

                    This means that tasks like [MODE-1686] Support connectors to systems that contain their own search capabilities - JBoss Issue Tracker will probably be easier to implement and in the case of some connectors (JDBC for example) may offer significant performance advantages.

                    Hope this helps.