8 Replies Latest reply on Mar 24, 2015 5:52 PM by brmeyer

    Full-text search of binary content when querying against a mixin type

    brmeyer

      Hey guys, apologies if I'm missing something obvious...

       

      I'm trying to implement full-text search for Artificer, which includes binary values (filesystem storage).  Most of our queries use the abstract mixin (sramp:baseArtifactType) that's the base of all our other mixins.  That mixin could be on a nt:hierarchyNode (if it's logical metadata only) or on an nt:file with an nt:resource child (artifact with content).  So, the query looks something like:

       

      SELECT artifact1.* FROM [sramp:baseArtifactType] AS artifact1 WHERE (CONTAINS(artifact1,'%foo%') AND ISDESCENDANTNODE([sramp:baseArtifactType],'/s-ramp'))
      

       

      I have the Tika text extractor setup, but that query is failing to return artifacts whose content should be a match.

       

      1. Is the query expected to fail since I'm using a mixin as the root selector?  Would I need to switch things around to select "nt:hierarchyNode" instead, or should the mixin work?
      2. I have "textExtraction" defined in the JSON config, using the "tika" selector as the classname.  modeshape-extractor-tika is a dependency in the POM.  Are there any other steps necessary?  Am I correct that Tika provides a handful of extractors out of the box, or do I need to explicitly provide all of them I'd need?  In this case, I'm testing simple XML files.

       

      Thanks for any help available!

        • 1. Re: Full-text search of binary content when querying against a mixin type
          hchiorean
          1. Is the query expected to fail since I'm using a mixin as the root selector?  Would I need to switch things around to select "nt:hierarchyNode" instead, or should the mixin work?

          No, you can use the mixin as the root selector, but your FTS criteria isn't correct: the selector should either be node.* or propertyName or node.propertyName (see JCR-SQL2 - ModeShape 4 - Project Documentation Editor). Also % will be treated as-is - an exact match - since this is not a LIKE query. One way to write your selector is: CONTAINS(artifact1.*, 'foo')

          2. I have "textExtraction" defined in the JSON config, using the "tika" selector as the classname.  modeshape-extractor-tika is a dependency in the POM.  Are there any other steps necessary?  Am I correct that Tika provides a handful of extractors out of the box, or do I need to explicitly provide all of them I'd need?  In this case, I'm testing simple XML files.

          Dependency & configuration wise this should be enough unless your application has other custom exclusions which would prevent the tika-parsers & tika-core JARs from being available in your CP at runtime. IIRC Tika has an XML extractor bundled among the default parsers, so text should be extracted. If the query still won't return any results, you should try looking at DEBUG logs for org.modeshape.jcr and org.modeshape.extractor.tika for any possible indication as to why this is happening.

          • 2. Re: Full-text search of binary content when querying against a mixin type
            brmeyer

            Thanks, Horia.

            No, you can use the mixin as the root selector, but your FTS criteria isn't correct: the selector should either be node.* or propertyName or node.propertyName (see JCR-SQL2 - ModeShape 4 - Project Documentation Editor). Also % will be treated as-is - an exact match - since this is not a LIKE query. One way to write your selector is: CONTAINS(artifact1.*, 'foo')

            CONTAINS(artifact1,'foo') appears to be working on the metadata properties, but not the file contents.  Should it?  Or should it be expected to fail without node.*, propertyName, or node.propertyName?


            Also, for the life of me, I cannot get QOMF#fullTextSearch to produce node.*


            factory.fullTextSearch("artifact1", null, factory.literal("foo")) --> CONTAINS(artifact1,'foo')

            factory.fullTextSearch("artifact1", "*", factory.literal("foo")) --> CONTAINS(artifact1.[*],'foo')

            factory.fullTextSearch("artifact1.*", null, factory.literal("foo")) --> CONTAINS([artifact1.*],'foo')


            If the propertyName is required, shouldn't QOMF default to .* if only the selector is given, much like how QOMF#column works?


            Dependency & configuration wise this should be enough unless your application has other custom exclusions which would prevent the tika-parsers & tika-core JARs from being available in your CP at runtime. IIRC Tika has an XML extractor bundled among the default parsers, so text should be extracted. If the query still won't return any results, you should try looking at DEBUG logs for org.modeshape.jcr and org.modeshape.extractor.tika for any possible indication as to why this is happening.

            I'm not seeing any errors, etc.  However, here's some relevant bits:

             

            11:22:03,596 DEBUG org.modeshape.extractor.tika is not a valid url

            11:22:04,125 DEBUG Initializing the Tika MIME type detectors

            11:22:04,125 DEBUG  - Found detector: org.gagravarr.tika.OggDetector

            11:22:04,126 DEBUG  - Found detector: org.apache.tika.parser.microsoft.POIFSContainerDetector

            11:22:04,126 DEBUG  - Found detector: org.apache.tika.parser.pkg.ZipContainerDetector

            11:22:04,126 DEBUG  - Found detector: org.apache.tika.mime.MimeTypes

             

            Is the "is not a valid url" of concern?  Note that this specific example is a unit test that's not running within Wildfly, so no Tika JBoss module is available.  However, all the transitive dependencies are on the CP, so that *shouldn't* be a problem (unless I'm missing something).

             

            Is it odd that only a small # of detectors are found?  And that no XML detectors are available?

             

            Thanks for the continued help!

            • 3. Re: Full-text search of binary content when querying against a mixin type
              hchiorean

              I have absolutely no idea how FTS is supposed to look in QOM format, but presumably you can find that information via the JCR spec.

               

              Tika's default XML extractor only extracts text from the value of XML elements, it completely ignores the structure & attributes: http://tika.apache.org/1.2/formats.html#XML_and_derived_formats

              We have integration tests which upload files & query those files for extracted text using JCR-SQL2 and they all work fine: https://github.com/ModeShape/modeshape/blob/master/extractors/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java#L160. I also wrote a quick test for an XML file and provided you're only querying for the value of XML elements, it works fine.

               

              I don't think the log message are relevant, since the DcXMLParser class should always be in classpath.

              • 4. Re: Full-text search of binary content when querying against a mixin type
                brmeyer

                Thanks, Horia!

                Tika's default XML extractor only extracts text from the value of XML elements

                Right, that's exactly what I'm searching for -- text values, not attributes.

                We have integration tests which upload files & query those files for extracted text using JCR-SQL2 and they all work fine

                No other config changes necessary?  You simply use the Tika extractor OOTB and it works?  No need to explicitly add an XML extractor, etc?

                • 6. Re: Full-text search of binary content when querying against a mixin type
                  brmeyer

                  The artifact metadata is on an nt:file node.  That node has an nt:resource child on path "jcr:content". The resource node sets "jcr:mimeType" as "application/xml".  Any other requirements involving the node structure or properties that might affect the extractor?

                   

                  Thanks for the continued help.

                  • 7. Re: Full-text search of binary content when querying against a mixin type
                    hchiorean

                    The only thing here that can influence the behavior is the mime type: each extractor can be configured explicitly to accept/reject certain mime-types - i.e. extract content only from files which have an accepted mime type. If such an explicit configuration does not exist though, the default excluded - not excepted mime types are: modeshape/TikaTextExtractor.java at master · ModeShape/modeshape · GitHub. Anything not found in that list will be considered "accepted" and Tika will attempt to perform extraction.

                     

                    The only thing I can suggest is that you debug your local use case and take an in-depth look at what's going on.

                    • 8. Re: Full-text search of binary content when querying against a mixin type
                      brmeyer

                      Argh, sorry Horia, just realized I made a really stupid mistake.  The content child node wasn't joined, so the binary values were not included in the property list.

                       

                      So, in case it's helpful to others, if you have an nt:file node with an nt:resource child, in order to get the Tika extractors to kick in, the query must look something like:

                       

                      SELECT artifact1.* FROM [sramp:baseArtifactType] AS artifact1 LEFT OUTER JOIN [nt:resource] AS content1 ON ISCHILDNODE(content1,artifact1) WHERE CONTAINS(artifact1,'Lawn Mower') OR CONTAINS(content1,'Lawn Mower')
                      

                       

                      Prior to that, I simply had:

                       

                      SELECT artifact1.* FROM [sramp:baseArtifactType] AS artifact1 WHERE CONTAINS(artifact1,'Lawn Mower')