1 2 Previous Next 19 Replies Latest reply on Feb 27, 2013 8:04 AM by rhauch

    Full text search and text extractors

    mmatloka

      Hi,

      I'm looking for information concerning text extractors and full text search in Modeshape 3.0. Unfortunatelly documentation pages are empty.

        • 1. Re: Full text search and text extractors
          rhauch

          I've updated the following pages in our documentation:

           

           

          If they aren't sufficient, please let us know.

          • 2. Re: Full text search and text extractors
            mmatloka

            Hi,

             

            To use this extractor, simply include the modeshape-extractor-tika JAR and the appropriate required Tika JARs are on the classpath

             

            I'm running modeshape as jboss as submodule. Then JAR's should be attached as submodules? And of course how does look like configuration for jboss as?

            • 3. Re: Full text search and text extractors
              rhauch

              I'm running modeshape as jboss as submodule. Then JAR's should be attached as submodules? And of course how does look like configuration for jboss as?

              If you unzipped the ModeShape kit for AS7, then the ModeShape is installed as a subsystem in AS7 and the ModeShape modules (with JARs) will be installed correctly into the AS7 installation. Our documentation contains an entire section on how to install, configure, and use ModeShape inside AS7.

              • 4. Re: Full text search and text extractors
                mmatloka

                Ah sorry, I haven't noticed it is already here. And example is also in standalone-modeshape.xml. Sorry

                • 5. Re: Full text search and text extractors
                  mmatloka

                  Is it possible to perform full text search on an object, and all objects referenced by it in multi-value (REFERENCE) property (eg. referenced files and their content)? I'm trying to do it in JCR-QOM, but I dont' really know how to formulate join condition for such case.

                  • 6. Re: Full text search and text extractors
                    rhauch

                    Is it possible to perform full text search on an object, and all objects referenced by it in multi-value (REFERENCE) property (eg. referenced files and their content)? I'm trying to do it in JCR-QOM, but I dont' really know how to formulate join condition for such case.

                     

                    You can apply full-text search criteria in any query, so your question boils down to how to find a node and all nodes that contain a REFERENCE to it. I can think of only one way to do this with a single standard JCR-SQL2 (and JCR-QOM) query, and that's to use an equi-join on the REFERENCE property. Unfortunately, this requires that you know the REFERENCE property (or properties). To make it easier to explain, I'm going to assume that the node type of the nodes containing the REFERENCE property "ref" is called "a" and the node type of the referenced nodes are "b". (The nodes can even be of the same node type, but that just makes it harder to explain the query.) Then the query would look something like this:

                     

                    SELECT * FROM a JOIN b ON a.[ref] = b.[jcr:uuid] 
                    WHERE CONTAINS(a,'hello world') AND CONTAINS(b,'hello world')
                    

                     

                    Because I used an 'AND' here, the nodes on both sides of the join will have to satisfy the full-text search and where the 'a' node references the 'b' node. If you use an 'OR' instead, then the results will include 'a' and 'b' nodes where at least 'a' or 'b' (or both) satisfy the full-text search and the 'b' node references that 'a' node.

                     

                    ModeShape's extensions to the JCR-SQL2 and JCR-QOM offer an alternative way to do this. The first is with a subquery to first find all of the 'b' nodes that satisfy the full-text search, and then to find all the 'a' nodes that satisfy the search and also satisfy the full-text search:

                     

                    SELECT a.* FROM a 
                    WHERE CONTAINS(a,'hello world') 
                    AND a.[ref] IN ( SELECT b.[jcr:uuid] FROM b WHERE CONTAINS(b,'hello world') )
                    

                     

                    However, note that this query only returns the 'a' nodes (since the 'b' nodes are embedded within the subquery), so while it's not exactly equivalent to the JOIN query it may be more useful in some situations.

                     

                    I hope this helps. There queries might not exactly match what you're trying to do, but hopefully it gives you an idea of how to achieve what you're looking for.

                     

                    Best regards

                    • 7. Re: Full text search and text extractors
                      mmatloka

                      In my case object 'a' may contain or not references in this multi-value fields, so I'm studying OR based queries. However I have some problems with joins.

                       

                       

                      These queries does not return any nodes:

                      SELECT a.* FROM [a] JOIN [b] ON a.[ref] = b.[jcr:uuid] WHERE CONTAINS(a.*,'hello world') OR CONTAINS(b.*,'hello world')

                      SELECT a.* FROM [a] LEFT OUTER JOIN [b] ON a.[ref] = b.[jcr:uuid] WHERE CONTAINS(a.*,'hello world') OR CONTAINS(b.*,'hello world')

                      SELECT a.* FROM [a] RIGHT OUTER JOIN [b] ON a.[ref] = b.[jcr:uuid] WHERE CONTAINS(a.*,'hello world') OR CONTAINS(b.*,'hello world')

                       

                       

                      This returns too many nodes:

                      SELECT a.* FROM a FULL OUTER JOIN b ON a.[ref] = b.[jcr:uuid] WHERE CONTAINS(a.*,'hello world') OR CONTAINS(b.*,'hello world')

                       

                       

                      Simple SELECT * FROM a WHERE CONTAINS(a.*,'hello world') returns valid nodes.

                      • 8. Re: Full text search and text extractors
                        rhauch

                        Without a runnable test case, it's hard for us to help identify the exact query you want to use. So I'd suggest that you reduce your queries to make sure that the various parts of the query work. For example, you stated that "SELECT * FROM a WHERE CONTAINS(a.*,'hello world')" returns valid nodes, so that's good. Can you get the join to work (returning more rows than you ideally want) without the CONTAINS clause? If not, then you need to focus on the join part of the query.

                         

                        But the node types are likely very important, too. What node types are you using?

                        • 9. Re: Full text search and text extractors
                          mmatloka

                          Every used node types is my custom type. Aproximatelly I have following relation between those nodes

                           

                          [a]

                          - ref (Reference) multiple

                           

                          [b] > mix:referenceable

                           

                          I have checked these queries ommiting 'contains'. Theoretically, when I have only objects from [a] not referencing to any [b], a LEFT OUTER JOIN b should return all objects from [a]. but returns 0 elements. When I have b elements referenced by a. LEFT OUTER JOIN returns properly number of all objects.

                           

                          Am I right?

                          • 10. Re: Full text search and text extractors
                            rhauch

                            A LEFT OUTER JOIN should indeed return all rows/nodes on the left side of the join even when the join criteria resulted in a null row/result on the right. I wonder if the 'multiple' aspect of the REFERENCE property is messing up the OUTER JOIN behavior.

                            • 11. Re: Full text search and text extractors
                              mmatloka

                              I've checked with other property - type REFERENCE, not 'multiple' and similar join queries. I have received exacly the same results as mendtioned before:

                               

                               

                              I have checked these queries ommiting 'contains'. Theoretically, when I have only objects from [a] not referencing to any [b], a LEFT OUTER JOIN b should return all objects from [a]. but returns 0 elements. When I have b elements referenced by a. LEFT OUTER JOIN returns properly number of all objects.

                               

                              • 12. Re: Full text search and text extractors
                                mmatloka

                                Typ A

                                property name

                                property reference 'ref'

                                 

                                Type B

                                property name

                                property reference 'ref'

                                 

                                values in repository


                                type a, name 'something' , ref =empty (null)

                                 

                                query:

                                 

                                SELECT a.* FROM [a] LEFT OUTER JOIN [b] ON a.[ref] = b.[jcr:uuid]

                                 

                                 

                                Result of query = empty

                                 

                                In code:

                                NestedLoopJoinComponent in   case LEFT_OUTER:

                                 

                                Object leftValue = leftSelector.evaluate(leftTuple);

                                                    if (leftValue == null) {

                                                        continue;

                                                    }

                                 

                                leftValue is null - because my ref is empty. continue executed, nothing added to results

                                 

                                Questios is, what should be result of this query? I would like to get this element of type a on this type of join.

                                Randall Hauch wrote:

                                 

                                A LEFT OUTER JOIN should indeed return all rows/nodes on the left side of the join even when the join criteria resulted in a null row/result on the right. I wonder if the 'multiple' aspect of the REFERENCE property is messing up the OUTER JOIN behavior.

                                  If you'd confirm it is a bug I will create a jira.

                                • 13. Re: Full text search and text extractors
                                  rhauch

                                  I think it is a bug. The behavior of null values is challenging in SQL (since a NULL never matches any other values, including NULL itself), and the JCR 2 specification doesn't seem to explicitly define how null values are to be treated within a join condition. However, I think in the case of a LEFT OUTER JOIN and a null left-tuple value, the left-tuple should just be included automatically with nulls for the right tuple.

                                   

                                  Please log a bug.

                                  • 14. Re: Full text search and text extractors
                                    mmatloka

                                    I have noticed that there is a code comment in this class

                                     

                                     

                                    // Note that in SQL joins, a NULL value on one side of the join criteria is not considered equal to

                                    // a NULL value on the other side. Therefore, in the following algorithms, we're shortcutting the

                                      // loops as soon as we get any NULL value for the join criteria.

                                      // see http://en.wikipedia.org/wiki/Join_(SQL)#Inner_join

                                     

                                    1 2 Previous Next