7 Replies Latest reply on Sep 25, 2013 12:15 PM by Yury Brigadirenko

    Fulltext search in node properties and child nodes

    Yury Brigadirenko Newbie

      Hi,

      I'm trying to do a fulltext search in JCR SQL2.The query should return all nodes which have at least one property containing the search string or they have child nodes with properties containing the same string.

      Query:

      SELECT *
      FROM [test:Parent] as p 
      LEFT OUTER JOIN [test:Child] as c ON ISCHILDNODE(c, p) 
      WHERE 
       contains(p.*, 'name2') or contains(c.*, 'name2')
      

       

      Repository data:

      /p1
      /p1/jcr:primaryType = test:Parent
      /p1/test:parentName = name1
      /p1/c1
      /p1/c1/jcr:primaryType = test:Child
      /p1/c1/test:childName = name2
      
      /p2
      /p2/jcr:primaryType = test:Parent
      /p2/test:parentName = name2
      
      /p3
      /p3/jcr:primaryType = test:Parent
      /p3/test:parentName = name3
      

       

      Expected result:

      /p1 with c1/ (child node contains string 'name2')

      /p2 (this node contains string  'name2')

       

      Actual result:

      /p2 {jcr:primaryType=test:Parent, test:parentName=name2}

      /p1 {jcr:primaryType=test:Parent, test:parentName=name1}

      /p3 {jcr:primaryType=test:Parent, test:parentName=name3}

       

       

      Another Query (nodes does not contains search text):

      SELECT *
      FROM [test:Parent] as p 
      LEFT OUTER JOIN [test:Child] as c ON ISCHILDNODE(c, p) 
      WHERE 
       contains(p.*, 'iddqd') or contains(c.*, 'iddqd')
      

       

      Expected result:

      nothing

       

      Actual result:

      /p2 {jcr:primaryType=test:Parent, test:parentName=name2}

      /p1 {jcr:primaryType=test:Parent, test:parentName=name1}

      /p3 {jcr:primaryType=test:Parent, test:parentName=name3}

       

      Is actual results correct?

       

      more tests on this problem: https://github.com/brig/modeshape-test

       

      I think the problem is here: https://github.com/ModeShape/modeshape/blob/master/modeshape-jcr/src/main/java/org/modeshape/jcr/query/process/SelectCom…

        • 1. Re: Fulltext search in node properties and child nodes
          Horia Chiorean Master

          Can you please open a JIRA for this issue, thanks.

          • 3. Re: Fulltext search in node properties and child nodes
            Randall Hauch Master

            Full text search is tricky for a couple of reasons.

             

            First, ModeShape (via Lucene) tokenizes all of the property values and names of nodes as they are indexed, and each token is further stemmed. The same process happens to the literal terms or phrases in a full-text search criteria (e.g., "CONTAINS"). Stemming changes a word down to its "root" form, and is definitely language-specific. For example, in English stemming will (among other things) remove extremely common words that would result in useless indexes (e.g., "a", "the", "an") and will change plural words into their singluar form (e.g., "cars" to "car") and past tense into present (e.g., "drove" into "drive", or "played" into "play"). The criteria you're using is nonsense from an English perspective, and thus stemming is going to fail pretty spectacularly.

             

            The second thing to consider is that full text searches are not really "positive" or "negative", but rather on a spectrum. This is why every query result includes a score, and why ModeShape has introduced a "jcr:score" pseudocolumn. Unfortunately, the scores are not absolute, so there's no quantitative value for a "good" or "bad" match. Please check your scores - my suspicion is that the scores are actually pretty low but non-zero. I believe they are non-zero because the criteria you used were not English words, and thus they couldn't be stemmed and thus Lucene tries to do "something" with them. Thus, the nodes are simply those that are closest to your criteria, and that if you were to look at the score you'd actually see a pretty low score, meaning they were not very close at all.

             

            Thirdly, remember that full-text searching is very different than LIKE, which is pattern matching within a query that does have certainty: something will either completely satisfy or completely not satisfy the pattern. If you want to find nodes with properties that match some pattern (e.g., "%ington") then be sure to use LIKE. Full-text searching is very different.

             

            I would suggest a couple of things:

             

            1. Look at the scores of good matches and of some bad matches. If your scores fall into some nice range, you can easily impose some limit on the score by using something similar to "[jcr:score] > 0.1".
            2. Rather than test with criteria that contains terms that do not appear in English (and are not acronyms or made up words), use terms that are real words. If you want to test failing to find results, use words that you know do not appear in your nodes.
            3. Use LIKE if that really is what you're doing.
            • 5. Re: Fulltext search in node properties and child nodes
              Randall Hauch Master

              Thanks for the great information! We'll try to get the bugs fixed as soon as possible.

              • 6. Re: Re: Fulltext search in node properties and child nodes
                Randall Hauch Master

                Just a note about the query. As noted in MODE-2054, the proper query is actually a UNION query. This is because putting OR criteria in a WHERE clause of a OUTER JOIN is generally bad practice and will give generally unexpected results, even in SQL. (The correct approach in SQL is to put such criteria into the JOIN criteria.)

                 

                The equivalent UNION query is something like:

                 

                SELECT p.[jcr:path] FROM [test:parent] AS p
                WHERE CONTAINS(p.*, 'name2')
                UNION
                SELECT p.[jcr:path] FROM [test:parent] AS p JOIN [test:Child] AS c ON ISCHILDNODE(c,p)
                WHERE CONTAINS(c.*, 'name2')
                

                 

                Note that you cannot use "p.*" in the SELECT clauses, since each query in the UNION would return a result set with a different structure. A set operation (e.g., UNION, INTERSECT, and EXCEPT) all require the result set from each query to be the same, thus you need to be explicit in the SELECT clauses.

                 

                Unfortunately, the above UNION has a bug (MODE-2057) because one side of the UNION involves a JOIN while the other does not. However, this has already been fixed in the 'master' branch and will be available in ModeShape 3.6.

                 

                Because the actual problem in MODE-2054 has to do with the ORed CONTAINS criteria not properly being applied (again, only during an OUTER JOIN), this will be very difficult to fix with the query engine as it is in 3.x. However, in 4.0 we are overhauling the query engine and indexing functionality to vastly improve the capability and options, and a side effect is that it will be much easier to apply such CONTAINS criteria above the OUTER JOIN rather than being pushed down to the indexes. Therefore, I've retargeted MODE-2054 to be fixed in 4.0.