9 Replies Latest reply on Sep 23, 2014 7:59 AM by bes82

    Additional questions regarding indexing

    bes82

      While starting to work with indexes a few more questions have popped up:

       

      I'm using a lot of queries that you normally rather expect in relational dbs combined with a bit of jcr hierarchy, for example.

       

      SELECT A.* from [nt:a] AS A

      JOIN [nt:b] AS B  ON B.[jcr:uuid]=A.refId

      JOIN [nt:c] AS C  ON C.[jcr:uuid]=B.refId

      JOIN [nt:d] AS D  ON D.[jcr:uuid]=A.refId2 AND ISDESCENDANTNODE(D,A)

      WHERE C.sysName = "X"

      AND D.sysName = "Y"

       

      What I do first is to index sysName on C,D, refId on A,B and refId2 on A;

       

      What I'm wondering is, if this is enough to filter nodes completely from the Index?

      Is the Index functionality intelligent enough to use all available indexes and intersect the affected nodesets just from the indexes?

       

      Additionally I read something about jrc:uuid and jcr:path being indexed automatically, but the actual index files on disk don't seem to contain any information about that,

      So do I have to add an index for jcr:uuid for B,C and D? Do I also have to add a jcr:path index for D and A or how is ISCHILDNODE using the index?

       

      Additionall do I have to add a global Index for jcr:primaryType in order to make joining work faster?

       

      My goal is that nothing has to be read from the repo until the affected nodes have been determined. I can't understand how that's going to work until I manually add indexes for path and id?

       

      Could you please enlighten me ?

        • 1. Re: Additional questions regarding indexing
          rhauch

          The join processor does not use indexes; rather each access query within the query is eligible for using an index. Look at the query plan for details. The query you list above will have 4 access queries, but most are "dependent joins" that require the inputs from other access queries or joins. Again, the query plan will show these dependencies.

           

          Generally, you don't need to have an index for 'jcr:uuid' (or 'mode:id'), since ModeShape is already optimized for looking nodes up by identifier. Also, there are implicit indexes for cases where the constraint includes a literal value in the path, child node constraint, or descendant node constraint. However, when non-literal constraints are used (e.g., join constraints), then you might want to use an index on 'jcr:path', for example.

          1 of 1 people found this helpful
          • 2. Re: Additional questions regarding indexing
            bes82

            Thanks for the explanation.

             

            I'm not sure if I completely understand what the query plan is telling me. It shows dependent queries and their used undexs (I guess) But does it tell me when and why nodes have to be fetched?

             

            For instance SELECT nt:a as A JOIN nt:b as B on B.property=A.property WHERE b.anotherProperty ='value';

            Indexes on anotherProperty  and property.

             

            The plan tells my that first B is fetched using the index on 'anotherProperty' so this index delivers id's of B nodes. Then A is fetched using the index on 'property'.

             

            What it doesn't (seem to) tell me is, if node B first has to be loaded from the repo to know Bs value of 'property' or if the value of Bs property 'property' will be obtained from the index (where it could also be found).

             

            Is there a way the plan can tell me if / when / why a node has to be fetched during query execution?

            • 3. Re: Additional questions regarding indexing
              rhauch

              What it doesn't (seem to) tell me is, if node B first has to be loaded from the repo to know Bs value of 'property' or if the value of Bs property 'property' will be obtained from the index (where it could also be found).

               

              Is there a way the plan can tell me if / when / why a node has to be fetched during query execution?

              Remember that each of these nodes is a relational operation that is operating upon a sequence of "rows" (or tuple), where each row/tuple contain one or more nodes. Well, strictly speaking, these tuples contain node keys, but to do any kind of processing or filtering on properties or child references, the node for those keys have to be obtained.

               

              Many of the operations take a single incoming sequence as input and applies its logic to it. PROJECT operations will define the set of "columns" that are needed above it. SELECT operations filter the tuples and emit only those that satisfy the criteria. SORT operations will reorder the sequence of node keys to satisfy the ordering constraints; these can be expensive since the whole sequence has to be known before they all can be ordered, but they are not memory intensive since the tuples are written off-heap. DUP_REMOVAL operations filter the tuples so that any tuple (e.g,. a combination of node keys in that row) is only returned once. LIMIT operations allow only a specified number of tuples throw, possibly ignoring the first operations if an offset is specified.

               

              Other operations take two incoming sequences as input and combines them according to their logic. For joins, the tuples of the output will always have a width that is the sum of the left and right side. A JOIN operation will use the join criteria to figure out which tuples from the left sequence are associated with which tuples on the right sequence, and it will return the "combined" tuples that have keys from the left and from the right if there is a match, or if there is no match then it will include the keys from the left or from the right and null values for side for which no match was found. A DEP_JOIN operation is a special kind of join where one side needs to be executed and its results passed down as criteria on the other side.

               

              A SET_OPERATION takes a left and a right sequence of tuples that are structurally identical, and will find the intersection, union, and or difference between these sequences. Unlike joins, set operations never combine tuples but always simple compare them and filter them.

               

              A SOURCE operation is responsible for finding all node keys of some type, and the tuples always are of size 1. If an INDEX operation with 'USED=true' appears below the SOURCE, then that index is used to return which node (keys) should be returned in the sequence because those nodes satisfy the prescribed criteria. If no INDEX node is present, the SOURCE operation will find the sequence of node keys by scanning the whole workspace to find those nodes that satisfy the criteria. Scanning is always expensive, so you always want an INDEX operation to appear on all branches.

               

              With the exception of INDEX, all operations require the node to be accessible, since the node is where all property values are obtained in order for the operation to perform its work. For example, a SORT operation is always ordering based upon the values of one or more property values, and thus the node must be materialized into memory before the property values can be returned.

               

              The INDEX operation is the sole operation that does not need the node to be materialized. In fact, the whole purpose of the index is to "know" which node keys satisfy the constraints passed down to the index, and these constraints are always listed on each INDEX operation. This is why indexes make querying so much faster, and it's why when multiple INDEX operations are possible ModeShape will always return the one that returns the fewest keys (lowest cardinality) with the lowest cost. If two indexes have the same cost and "know" about the same number of nodes, then the most efficient index will always be the one with the lower selectivity, since it will always return fewer node keys in its output sequence.

               

              For example, let's say your repository has 100k nodes of type 'policy' (that subtypes 'mix:lastModified'), but you're trying to find all automotive policies that have been modified within the last week. Let's assume that you have two indexes: one on the modification timestamp, and the other on the kind of policy.

               

              If 40% of your policies are automotive policies, then the index on the policy type would have a selectivity of 0.40 for a constraint 'type=automotive'. That means that for this query, the INDEX operation will return a sequence of 40k policy node keys, and all 40k nodes would have to be materialized in order for the additional criteria to be applied.

               

              On the other hand, let's say that only 0.5% of all policies are typically modified during any given week. Then the index on the modification timestamp for policy nodes for would, for that criteria, have a selectivity of 0.005 (returning 0.5% of all the nodes the index knows about). That index still returns a sequence of keys for 500 nodes, and all 500 nodes have to be materialized in order for additional criteria to be applied or for any joins to be used.

               

              In this case, it is far better for the last-modified index to be used, because materializing 500 nodes is far better than materializing 40k nodes. So if our assumed statistics hold true, then of those 500 nodes that have been modified during the last week then roughly 40% (or 200) of them will also be automotive policies, and the query would ultimately return roughly 200 nodes.

               

              The INDEX operation in the query plan lists the selectivity, the cardinality, and the total number of nodes known to the index, because these are critically important to understanding how efficient the query is and how much work (including materialization) has to be performed.

              • 4. Re: Additional questions regarding indexing
                bes82

                I noticed something today I don't quite understand. During performance tuning of my app I stumbled upon this rather uncomplicated query.

                 

                SELECT BASE.* from [nt:someTypeThatHasAlsoMixReferencableMixin] WHERE [jcr:uuId] = 'literalvalue'

                 

                Now your first thought might be why not use getNodeByIdentifier: Because this is just a subquery.

                 

                Anyway: If I don't define indexes this query takes about 150ms. I tried to figure out if it takes more time when there is more content in the repo but that doesn't seem to be the case.

                 

                If I define an index on [jcr:uuid] then this Index is used and the query takes about 20ms.

                 

                That I don't understand. I thought search for Id is always fast without Index. Is it possible that "WHERE [jcr:uuId] = ''literalvalue" works totally different than getNodeByIdentifier And a "manual" index is always needed for this query?

                • 5. Re: Additional questions regarding indexing
                  rhauch

                  Have you tried that query on its own? There should be an implicit index that handles this constraint. Perhaps it's not working in a subquery or in a set criteria?? If either of those doesn't work, please log an issue.

                  • 6. Re: Additional questions regarding indexing
                    bes82

                    Yes I have extracted the query, no index showed up in the plan.

                     

                    What I mentioned in the previous post is not 100% accurate, the search value is not a literal, it's:

                     

                    SELECT BASE.* from [nt:someTypeThatHasAlsoMixReferencableMixin] AS BASE WHERE [jcr:uuId] = $id

                     

                    And then query.bindValue. Not sure if that changes anything though.

                     

                     

                    But I'm not sure what exactly to report into an issue. Should there be some info about an implicit index in the plan?

                     

                    Because I've never seen anything else than my own indexes showing up in the plan.

                    • 7. Re: Additional questions regarding indexing
                      rhauch

                      Please log an issue about implicit indexes not working for this query. The query plan should include a built-in index when criteria like this is used, and it will look very similar to the other INDEX operations in the plan. (There is no real index; we know how to quickly find a node given its ID, so we have a special internal "index" implementation that uses the normal storage.)

                      • 8. Re: Additional questions regarding indexing
                        bes82

                        Thanks for the clarification. I'll double check it on Monday and report an issue.

                        • 9. Re: Additional questions regarding indexing
                          bes82

                          Edit, I figured out most if the answers myself.

                           

                          Please see also https://issues.jboss.org/browse/MODE-2316. I reported a feature Request there.

                           

                          In short:

                          It would be nice if reindexing all nodes bacause of a new index is a synchronous process and the call to registerIndex blocks.

                          It would be even nicer if One could tell the INdexManager that no reindexing is needed (because no nodes exist that would have to be indexed)