-
1. Re: Subtree Queries
rhauch Feb 9, 2012 11:14 AM (in response to bwallis42)I came up with the following:
SELECT d.[jcr:uuid],d.[inf:docType] FROM [inf:patient] as p join [inf:patientId] as pid on isdescendantnode(pid,p) join [inf:document] as d on isdescendantnode(d,p) WHERE pid.[inf:id] = '919191' and pid.[inf.ns] = 'CPF'
and it works but is it a "good" query? Will it be efficient?
To get an idea of the efficiency of the query, you can look at the query plan, which completely specifies how ModeShape has executed the query. To do this, cast the 'javax.jcr.query.QueryResult' object to 'org.modeshape.jcr.query.JcrQueryResult' and call 'getPlan()':
Query query = session.getWorkspace().getQueryManager().createQuery(sql, Query.JCR_SQL2); JcrQueryResult jcrResult = (JcrQueryResult)query.execute(); String plan = jcrResult.getPlan();
BTW, we'll make it easier to get the plan in 2.8 and 3.0 by adding a method to our public API.
Anyway, this is the query plan for this query (from which I've removed the "SOURCE_COLUMN" values for readability):
Project [d] <PROJECT_COLUMN_TYPES=[STRING, STRING], PROJECT_COLUMNS=[d.jcr:uuid, d.inf:docType]> Join [d,p] <JOIN_TYPE=INNER JOIN, JOIN_CONDITION=ISDESCENDANTNODE(d,p), JOIN_ALGORITHM=NESTED_LOOP> Join [p,pid] <JOIN_TYPE=INNER JOIN, JOIN_CONDITION=ISDESCENDANTNODE(pid,p), JOIN_ALGORITHM=NESTED_LOOP> Access [p] Project [p] <PROJECT_COLUMN_TYPES=[STRING, STRING, STRING, DOUBLE, STRING, LONG], PROJECT_COLUMNS=[p.jcr:primaryType, p.jcr:path, p.jcr:name, p.jcr:score, p.mode:localName, p.mode:depth]> Select [p] <SELECT_CRITERIA=p.jcr:primaryType = 'inf:patient'> Source [p] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=p, SOURCE_COLUMNS=[...]> Access [pid] Project [pid] <PROJECT_COLUMN_TYPES=[STRING, LONG, STRING, STRING, STRING, DOUBLE, STRING, LONG], PROJECT_COLUMNS=[pid.inf:ns, pid.inf:id, pid.jcr:primaryType, pid.jcr:path, pid.jcr:name, pid.jcr:score, pid.mode:localName, pid.mode:depth]> Select [pid] <SELECT_CRITERIA=pid.inf:id = '919191'> Select [pid] <SELECT_CRITERIA=pid.inf.ns = 'CPF'> Select [pid] <SELECT_CRITERIA=pid.jcr:primaryType = 'inf:patientId'> Source [pid] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=pid, SOURCE_COLUMNS=[...]> Access [d] Project [d] <PROJECT_COLUMN_TYPES=[STRING, STRING, STRING, STRING, STRING, DOUBLE, STRING, LONG], PROJECT_COLUMNS=[d.inf:docType, d.jcr:uuid, d.jcr:primaryType, d.jcr:path, d.jcr:name, d.jcr:score, d.mode:localName, d.mode:depth]> Select [d] <SELECT_CRITERIA=d.jcr:primaryType = 'inf:document'> Source [d] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=d, SOURCE_COLUMNS=[...]>
There are three "Access" nodes (queries against the Lucene indexes), and then two "Join" nodes that use the "NESTED_LOOP" algorithm to do the actual merge. The plan is actually a tree, and the bottom-most parts are executed first. Let's look at the "Access" nodes first.
The 2nd access node, "Access [pid]" is performing a query to find the nodes with the "pid.[inf:id] = 919191" and "pid.[inf.ns] = 'CPF'" criteria. This will be quite fast.
The 1st access node, "Access [p]" is performing a query to find the nodes that are of type "inf:patient". This will be highly dependent upon the number of patient objects, as you surmised.
Above these two access nodes is the "Join [p,pid]", which will be more expensive as the number of patient objects.
So without looking any further at the plan, we can tell that this query is not going to be ideal for large numbers of patients. Ideally, ModeShape could transparently re-write the query plan to perform a dependent join, where the "Access [pid]" query is executed first and the results are converted to a criteria on the "Access [p]" query. Unfortunately, ModeShape's query optimizer isn't capable of doing that yet.
Let's consider a few alternative approaches that you can use right now.
Alternative 1: use multiple queries
As mentioned earlier, the access query that finds the "inf:patientId" node given the criteria will be quite fast, as the criteria you provided will produce a single node in the results. We could leverage that speed and issue a single query to do this:
SELECT * FROM [inf:patientId] AS pid WHERE pid.[inf:id] = '919191' and pid.[inf.ns] = 'CPF'
Then we could get the node(s) in the result, find the parent, then use a more constrained query to find the documents under the parent node:
SELECT d.[jcr:uuid],d.[inf:docType] FROM [inf:document] AS d WHERE ISDESCENDANTNODE( d, '/path/to/patient')
This second query could be changed to use the ModeShape-specific "PATH(...)" dynamic operand:
SELECT d.[jcr:uuid],d.[inf:docType] FROM [inf:document] AS d WHERE PATH( d ) LIKE '/path/to/patient'
which then allows using bind variables:
SELECT d.[jcr:uuid],d.[inf:docType] FROM [inf:document] AS d WHERE PATH( d ) LIKE $patientPath
These last 3 queries are equivalent and all result in pushing the same path criteria down into the access query. An advantage of using this approach is that you can easily add criteria to limit the documents. Also, this approach works even if the criteria for the first query results in multiple patients (in such a case, the second query would need to use an OR-ed set of path criteria or the ModeShape-specific "IN" criteria. However, if you're going to use all (or even many) of the documents under the patient node anyway, there's another alternative that's probably faster.
Alternative 2: use query and navigation
A second is to use a query to find the patient node(s) using a query:
SELECT * FROM [inf:patientId] AS pid WHERE pid.[inf:id] = '919191' and pid.[inf.ns] = 'CPF'
but then use JCR navigation to simply find the parent node(s) and walk to all of the document nodes under that patient(s). This will most likely be a bit faster than Alternative 1, especially if you're going to use all (or most) of the documents under the patient. Why? Because the second query used in Alternative 1 is executed against the indexes and the criteria has to be used to filter out the nodes outside of the subgraph, whereas navigation is extremely fast and direct.
I hope this helps and gives you some additional tools to help you discover how other queries might be executed.
-
2. Re: Subtree Queries
bwallis42 Feb 11, 2012 5:28 PM (in response to rhauch)Thanks Randall, a useful and detailed answer as always. Don't know what I'd do without you!
regards,
brian...