6 Replies Latest reply on Sep 18, 2014 9:31 AM by rhauch

Performance issue with the RowIterator returned by a Query

amischler Sep 17, 2014 4:03 AM

Hi,

I experience a strange performance issue with the RowIterator returned by a Query.

I initialized a repository with 1000 nodes of type dooapp:myType where [dooapp:myType] > nt:unstructured

These nodes are properly dispatched in the repository tree so that no node has a large number of children (I followed the guidelines of this article https://modeshape.wordpress.com/2014/08/14/improving-performance-with-large-numbers-of-child-nodes/)

The repository is configured exactly as in the modeshape-examples/modeshape-filesystem-store-example at master · ModeShape/modeshape-examples · GitHub and I'm using Modeshape 4.0.0.Beta1

Then I execute the following query:

SELECT [jcr:path] FROM [dooapp:myType]

and I iterate on the result using the RowIterator returned by the QueryResult.

If the nodes of myType do not contain any children, the whole process run smoothly.

But when I start adding children and properties, to the nodes of myType, the first call to rowIterator.hasNext() or rowIterator.nextRow() requires a time growing up very fast up to ~20s. (although the query execution itself remains stable at ~300 ms)

I'm surprised that adding children and properties has an impact on iterating on the query result, since I'm only requesting the [jcr:path] column in my query. It looks like the initialization of the row iterator visits or loads the whole tree of result nodes and of its children.

Any idea where this could come from and how I could improve the execution time of the first call to the rowIterator?

Thanks for your help.

1. Re: Performance issue with the RowIterator returned by a Query

rhauch Sep 17, 2014 10:00 AM (in response to amischler)

Did you explicitly add an index for your query? With ModeShape 4.0, this is absolutely required to get decent performance out of your queries. See the 4.0 query and search documentation for details.

Even when using the RowIterator, ModeShape is going to require loading the node before returning it. ModeShape no longer indexes all of the possible values that might be returned in the row, and if your repository is undergoing a lot of change those values might have been a slight bit stale. Instead, ModeShape relies upon the indexes to select the fewest number of nodes in the repository that best match your criteria, and those nodes must then be materialized to further apply any remaining criteria.

For example, in this particular query, with just a FROM clause your query ends up using a criteria something like:

[jcr:primaryType] = 'dooapp:myType'

though the actual criteria used is more complicated since it looks at "jcr:primaryType" and "jcr:mixinTypes", and the literals are "dooapp:myType" and all subtypes of "dooapp:myType". You can create a "nodetype" index or, if your nodes all use the "dooapp:primaryType" for primary type, an index on the "dooapp:myType" node type with a "jcr:primaryType(NAME)" column.

Either of these indexes should be eligible, and it means ModeShape still have to load the nodes to get the path, but it only does this for the "dooapp:myType" nodes.

Without an index, the query engine must scan the entire repository looking for nodes of type "dooapp:myType".
Actions
2. Re: Performance issue with the RowIterator returned by a Query

amischler Sep 17, 2014 12:24 PM (in response to rhauch)

Thanks for your answer.

I have updated the configuration file of my repository with the following sections:

"indexProviders" : {
        "myProvider" : {
            "name" : "myProvider",
            "classname" : "org.modeshape.jcr.index.local.LocalIndexProvider",
            "directory" : "target/indexes",
        }
    },
    "indexes" : {
            "nodesByNodeType" : {
                "kind" : "value",
                "provider" : "myProvider",
                "synchronous" : "true",
                "nodeType" : "dooapp:myType",
                "columns" : "jcr:primaryType(NAME)"
            }
    }

The indexes seems to be built at startup (the target/indexes directory is created), but the query does not use the index. (I can't see any Index operartion in the query plan). Am I missing something?

Best regards,
Actions
3. Re: Performance issue with the RowIterator returned by a Query

rhauch Sep 17, 2014 3:35 PM (in response to amischler)

Is 'dooapp:myType' a mixin node type that is added to nodes as a mixin? If so, you have to use 'jcr:mixinTypes' rather than 'jcr:primaryType' for the column name.

I just tried a couple of new tests that'll soon go into the 'master' branch, and it seems to work fine for me.
Actions
4. Re: Performance issue with the RowIterator returned by a Query

rhauch Sep 17, 2014 4:02 PM (in response to rhauch)

It just occurred to me that you're using 4.0.0.Beta1, while I'm trying the latest from the 'master' branch which has a lot of fixes related to indexes. You might need to use 'master' or wait until Beta2 comes out (hopefully tomorrow).
1 of 1 people found this helpful
Actions
5. Re: Performance issue with the RowIterator returned by a Query

amischler Sep 18, 2014 8:11 AM (in response to rhauch)

You are right, it works fine with the latest from the 'master' branch. Thanks for your help!

Just for my information: does it mean that without indexes the QueryResult returned after the call to execute() on the Query object, did not actually execute the query but is more like a wrapper around the query plan and that the repository will be scanned only when using the iterator?
Actions
6. Re: Performance issue with the RowIterator returned by a Query

rhauch Sep 18, 2014 9:31 AM (in response to amischler)

Just for my information: does it mean that without indexes the QueryResult returned after the call to execute() on the Query object, did not actually execute the query but is more like a wrapper around the query plan and that the repository will be scanned only when using the iterator?

SOURCE operations (in the query plan) that have no used index below them will scan the repository and (literally) return all nodes to the parent operation in the plan. When the SOURCE operation contains an used index, then that index will use some of the criteria and return to the parent operation only those nodes that satisfy that criteria. Of course, there are usually multiple criteria, and those get applied higher in the plan.

Have a look at the query plans, since they explain for any query exactly what is going on.

ModeShape's query engine always tries to be lazy and only pull batches through the system when they are needed (e.g., the client is iterating through results). In terms of the query plan, the pull for the first batch starts at the top. If that operation has previously asked for a batch and still has rows in that batch, it will consume them; otherwise, it has no rows and must ask the operation(s) below it for the next batch. Any operation might need to process multiple batches from its children before it can return a single batch to its parent. BTW, much of the time, executing a query does cause the first batch of nodes at the top to be obtained, and this cascades down. And as we explain in the documentation, some operations (like SORT and most JOINs) require pre-fetching all or some of the results before the operation can do its work, and thus some of these do a lot of work just to answer that first batch.

Hope this helps.
1 of 1 people found this helpful
Actions

Go to original post