2 Replies Latest reply on Feb 11, 2012 5:28 PM by bwallis42

    Subtree Queries

    bwallis42

      I have a model with a tree like so (attribute names in [])

       

      /patient/section/document[docType,uuid]

      /patient/patientId[ns, id]

       

      So, patient has two subtrees, one with patient IDs (multiple) and one with sections (multiple) and documents (again, multiple).

       

      I want to find all the documents for a patient with the namespace and patientId of CPF and 919191. This ID is unique to the patient but a patient can have multiple IDs so the where clause on patientId should select a single patient.

       

      I came up with the following:

       

       

      {code:sql}

      SELECT d.[jcr:uuid],d.[inf:docType]

         FROM [inf:patient] as p join [inf:patientId] as pid on isdescendantnode(pid,p) join [inf:document] as d on isdescendantnode(d,p)

         WHERE pid.[inf:id] = '919191' and pid.[inf.ns] = 'CPF'

      {code}

      and it works but is it a "good" query? Will it be efficient. What happens when there are 915537 patientId nodes across 839532 patient nodes and 10,000,000 documents (about 11 per patient average but some have 0 and some have 100's). These are figures from a current production system of ours and this is typical of the sort of data we are going to be storing in a repository.

       

      I'm a bit concerned about how well queries are going to perform in a live system. I would hate to find out that it is going to be too slow after we move from our current storage scheme to ModeShape.

       

      Most of our queries are likely to be across the nodes for a single patient so I was wondering if there is a way to do a query restricted to a subtree so that once I have located the patient node that I require the rest of my queries are restricted to this subtree?

       

      thanks,

        • 1. Re: Subtree Queries
          rhauch

          I came up with the following:

          
          SELECT d.[jcr:uuid],d.[inf:docType] 
             FROM [inf:patient] as p join [inf:patientId] as pid on isdescendantnode(pid,p) join [inf:document] as d on isdescendantnode(d,p) 
             WHERE pid.[inf:id] = '919191' and pid.[inf.ns] = 'CPF'
          

           

          and it works but is it a "good" query? Will it be efficient?

           

          To get an idea of the efficiency of the query, you can look at the query plan, which completely specifies how ModeShape has executed the query. To do this, cast the 'javax.jcr.query.QueryResult' object to 'org.modeshape.jcr.query.JcrQueryResult' and call 'getPlan()':

           

           

                  Query query = session.getWorkspace().getQueryManager().createQuery(sql, Query.JCR_SQL2);
                  JcrQueryResult jcrResult = (JcrQueryResult)query.execute();
                  String plan = jcrResult.getPlan();
          

           

          BTW, we'll make it easier to get the plan in 2.8 and 3.0 by adding a method to our public API.

           

          Anyway, this is the query plan for this query (from which I've removed the "SOURCE_COLUMN" values for readability):

           

           

          Project [d] <PROJECT_COLUMN_TYPES=[STRING, STRING], PROJECT_COLUMNS=[d.jcr:uuid, d.inf:docType]>
            Join [d,p] <JOIN_TYPE=INNER JOIN, JOIN_CONDITION=ISDESCENDANTNODE(d,p), JOIN_ALGORITHM=NESTED_LOOP>
              Join [p,pid] <JOIN_TYPE=INNER JOIN, JOIN_CONDITION=ISDESCENDANTNODE(pid,p), JOIN_ALGORITHM=NESTED_LOOP>
                Access [p]
                  Project [p] <PROJECT_COLUMN_TYPES=[STRING, STRING, STRING, DOUBLE, STRING, LONG], PROJECT_COLUMNS=[p.jcr:primaryType, p.jcr:path, p.jcr:name, p.jcr:score, p.mode:localName, p.mode:depth]>
                    Select [p] <SELECT_CRITERIA=p.jcr:primaryType = 'inf:patient'>
                      Source [p] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=p, SOURCE_COLUMNS=[...]>
                Access [pid]
                  Project [pid] <PROJECT_COLUMN_TYPES=[STRING, LONG, STRING, STRING, STRING, DOUBLE, STRING, LONG], PROJECT_COLUMNS=[pid.inf:ns, pid.inf:id, pid.jcr:primaryType, pid.jcr:path, pid.jcr:name, pid.jcr:score, pid.mode:localName, pid.mode:depth]>
                    Select [pid] <SELECT_CRITERIA=pid.inf:id = '919191'>
                      Select [pid] <SELECT_CRITERIA=pid.inf.ns = 'CPF'>
                        Select [pid] <SELECT_CRITERIA=pid.jcr:primaryType = 'inf:patientId'>
                          Source [pid] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=pid, SOURCE_COLUMNS=[...]>
              Access [d]
                Project [d] <PROJECT_COLUMN_TYPES=[STRING, STRING, STRING, STRING, STRING, DOUBLE, STRING, LONG], PROJECT_COLUMNS=[d.inf:docType, d.jcr:uuid, d.jcr:primaryType, d.jcr:path, d.jcr:name, d.jcr:score, d.mode:localName, d.mode:depth]>
                  Select [d] <SELECT_CRITERIA=d.jcr:primaryType = 'inf:document'>
                    Source [d] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=d, SOURCE_COLUMNS=[...]>
          
          

           

          There are three "Access" nodes (queries against the Lucene indexes), and then two "Join" nodes that use the "NESTED_LOOP" algorithm to do the actual merge. The plan is actually a tree, and the bottom-most parts are executed first. Let's look at the "Access" nodes first.

           

          The 2nd access node, "Access [pid]" is performing a query to find the nodes with the "pid.[inf:id] = 919191" and "pid.[inf.ns] = 'CPF'" criteria. This will be quite fast.

          The 1st access node, "Access [p]" is performing a query to find the nodes that are of type "inf:patient". This will be highly dependent upon the number of patient objects, as you surmised.

          Above these two access nodes is the "Join [p,pid]", which will be more expensive as the number of patient objects.

           

          So without looking any further at the plan, we can tell that this query is not going to be ideal for large numbers of patients. Ideally, ModeShape could transparently re-write the query plan to perform a dependent join, where the "Access [pid]" query is executed first and the results are converted to a criteria on the "Access [p]" query. Unfortunately, ModeShape's query optimizer isn't capable of doing that yet.

           

          Let's consider a few alternative approaches that you can use right now.

           

          Alternative 1: use multiple queries

           

          As mentioned earlier, the access query that finds the "inf:patientId" node given the criteria will be quite fast, as the criteria you provided will produce a single node in the results. We could leverage that speed and issue a single query to do this:

           

           

          SELECT * FROM [inf:patientId] AS pid WHERE pid.[inf:id] = '919191' and pid.[inf.ns] = 'CPF'
          

           

          Then we could get the node(s) in the result, find the parent, then use a more constrained query to find the documents under the parent node:

           

          SELECT d.[jcr:uuid],d.[inf:docType] FROM [inf:document] AS d WHERE ISDESCENDANTNODE( d, '/path/to/patient')
          

           

          This second query could be changed to use the ModeShape-specific "PATH(...)" dynamic operand:

           

          SELECT d.[jcr:uuid],d.[inf:docType] FROM [inf:document] AS d WHERE PATH( d ) LIKE '/path/to/patient'
          

           

          which then allows using bind variables:

           

          SELECT d.[jcr:uuid],d.[inf:docType] FROM [inf:document] AS d WHERE PATH( d ) LIKE $patientPath
          

           

          These last 3 queries are equivalent and all result in pushing the same path criteria down into the access query. An advantage of using this approach is that you can easily add criteria to limit the documents. Also, this approach works even if the criteria for the first query results in multiple patients (in such a case, the second query would need to use an OR-ed set of path criteria or the ModeShape-specific "IN" criteria. However, if you're going to use all (or even many) of the documents under the patient node anyway, there's another alternative that's probably faster.

           

          Alternative 2: use query and navigation

           

          A second is to use a query to find the patient node(s) using a query:

           

          SELECT * FROM [inf:patientId] AS pid WHERE pid.[inf:id] = '919191' and pid.[inf.ns] = 'CPF'
          

           

          but then use JCR navigation to simply find the parent node(s) and walk to all of the document nodes under that patient(s). This will most likely be a bit faster than Alternative 1, especially if you're going to use all (or most) of the documents under the patient. Why? Because the second query used in Alternative 1 is executed against the indexes and the criteria has to be used to filter out the nodes outside of the subgraph, whereas navigation is extremely fast and direct.

           

           

           

          I hope this helps and gives you some additional tools to help you discover how other queries might be executed.

          • 2. Re: Subtree Queries
            bwallis42

            Thanks Randall, a useful and detailed answer as always. Don't know what I'd do without you!

             

            regards,

            brian...