-
1. Re: Full-text search of binary content when querying against a mixin type
hchiorean Mar 16, 2015 4:33 AM (in response to brmeyer)- Is the query expected to fail since I'm using a mixin as the root selector? Would I need to switch things around to select "nt:hierarchyNode" instead, or should the mixin work?
No, you can use the mixin as the root selector, but your FTS criteria isn't correct: the selector should either be node.* or propertyName or node.propertyName (see JCR-SQL2 - ModeShape 4 - Project Documentation Editor). Also % will be treated as-is - an exact match - since this is not a LIKE query. One way to write your selector is: CONTAINS(artifact1.*, 'foo')
2. I have "textExtraction" defined in the JSON config, using the "tika" selector as the classname. modeshape-extractor-tika is a dependency in the POM. Are there any other steps necessary? Am I correct that Tika provides a handful of extractors out of the box, or do I need to explicitly provide all of them I'd need? In this case, I'm testing simple XML files.
Dependency & configuration wise this should be enough unless your application has other custom exclusions which would prevent the tika-parsers & tika-core JARs from being available in your CP at runtime. IIRC Tika has an XML extractor bundled among the default parsers, so text should be extracted. If the query still won't return any results, you should try looking at DEBUG logs for org.modeshape.jcr and org.modeshape.extractor.tika for any possible indication as to why this is happening.
-
2. Re: Full-text search of binary content when querying against a mixin type
brmeyer Mar 16, 2015 11:41 AM (in response to hchiorean)Thanks, Horia.
No, you can use the mixin as the root selector, but your FTS criteria isn't correct: the selector should either be node.* or propertyName or node.propertyName (see JCR-SQL2 - ModeShape 4 - Project Documentation Editor). Also % will be treated as-is - an exact match - since this is not a LIKE query. One way to write your selector is: CONTAINS(artifact1.*, 'foo')
CONTAINS(artifact1,'foo') appears to be working on the metadata properties, but not the file contents. Should it? Or should it be expected to fail without node.*, propertyName, or node.propertyName?
Also, for the life of me, I cannot get QOMF#fullTextSearch to produce node.*
factory.fullTextSearch("artifact1", null, factory.literal("foo")) --> CONTAINS(artifact1,'foo')
factory.fullTextSearch("artifact1", "*", factory.literal("foo")) --> CONTAINS(artifact1.[*],'foo')
factory.fullTextSearch("artifact1.*", null, factory.literal("foo")) --> CONTAINS([artifact1.*],'foo')
If the propertyName is required, shouldn't QOMF default to .* if only the selector is given, much like how QOMF#column works?
Dependency & configuration wise this should be enough unless your application has other custom exclusions which would prevent the tika-parsers & tika-core JARs from being available in your CP at runtime. IIRC Tika has an XML extractor bundled among the default parsers, so text should be extracted. If the query still won't return any results, you should try looking at DEBUG logs for org.modeshape.jcr and org.modeshape.extractor.tika for any possible indication as to why this is happening.
I'm not seeing any errors, etc. However, here's some relevant bits:
11:22:03,596 DEBUG org.modeshape.extractor.tika is not a valid url
11:22:04,125 DEBUG Initializing the Tika MIME type detectors
11:22:04,125 DEBUG - Found detector: org.gagravarr.tika.OggDetector
11:22:04,126 DEBUG - Found detector: org.apache.tika.parser.microsoft.POIFSContainerDetector
11:22:04,126 DEBUG - Found detector: org.apache.tika.parser.pkg.ZipContainerDetector
11:22:04,126 DEBUG - Found detector: org.apache.tika.mime.MimeTypes
Is the "is not a valid url" of concern? Note that this specific example is a unit test that's not running within Wildfly, so no Tika JBoss module is available. However, all the transitive dependencies are on the CP, so that *shouldn't* be a problem (unless I'm missing something).
Is it odd that only a small # of detectors are found? And that no XML detectors are available?
Thanks for the continued help!
-
3. Re: Full-text search of binary content when querying against a mixin type
hchiorean Mar 17, 2015 4:51 AM (in response to brmeyer)I have absolutely no idea how FTS is supposed to look in QOM format, but presumably you can find that information via the JCR spec.
Tika's default XML extractor only extracts text from the value of XML elements, it completely ignores the structure & attributes: http://tika.apache.org/1.2/formats.html#XML_and_derived_formats
We have integration tests which upload files & query those files for extracted text using JCR-SQL2 and they all work fine: https://github.com/ModeShape/modeshape/blob/master/extractors/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java#L160. I also wrote a quick test for an XML file and provided you're only querying for the value of XML elements, it works fine.
I don't think the log message are relevant, since the DcXMLParser class should always be in classpath.
-
4. Re: Full-text search of binary content when querying against a mixin type
brmeyer Mar 24, 2015 10:10 AM (in response to hchiorean)Thanks, Horia!
Tika's default XML extractor only extracts text from the value of XML elements
Right, that's exactly what I'm searching for -- text values, not attributes.
We have integration tests which upload files & query those files for extracted text using JCR-SQL2 and they all work fine
No other config changes necessary? You simply use the Tika extractor OOTB and it works? No need to explicitly add an XML extractor, etc?
-
5. Re: Full-text search of binary content when querying against a mixin type
hchiorean Mar 24, 2015 10:30 AM (in response to brmeyer)No special configuration is required. You can see the test here: https://github.com/ModeShape/modeshape/blob/master/extractors/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorRepositoryTest.java#L68
-
6. Re: Full-text search of binary content when querying against a mixin type
brmeyer Mar 24, 2015 10:52 AM (in response to brmeyer)The artifact metadata is on an nt:file node. That node has an nt:resource child on path "jcr:content". The resource node sets "jcr:mimeType" as "application/xml". Any other requirements involving the node structure or properties that might affect the extractor?
Thanks for the continued help.
-
7. Re: Full-text search of binary content when querying against a mixin type
hchiorean Mar 24, 2015 11:19 AM (in response to brmeyer)The only thing here that can influence the behavior is the mime type: each extractor can be configured explicitly to accept/reject certain mime-types - i.e. extract content only from files which have an accepted mime type. If such an explicit configuration does not exist though, the default excluded - not excepted mime types are: modeshape/TikaTextExtractor.java at master · ModeShape/modeshape · GitHub. Anything not found in that list will be considered "accepted" and Tika will attempt to perform extraction.
The only thing I can suggest is that you debug your local use case and take an in-depth look at what's going on.
-
8. Re: Full-text search of binary content when querying against a mixin type
brmeyer Mar 24, 2015 5:52 PM (in response to brmeyer)Argh, sorry Horia, just realized I made a really stupid mistake. The content child node wasn't joined, so the binary values were not included in the property list.
So, in case it's helpful to others, if you have an nt:file node with an nt:resource child, in order to get the Tika extractors to kick in, the query must look something like:
SELECT artifact1.* FROM [sramp:baseArtifactType] AS artifact1 LEFT OUTER JOIN [nt:resource] AS content1 ON ISCHILDNODE(content1,artifact1) WHERE CONTAINS(artifact1,'Lawn Mower') OR CONTAINS(content1,'Lawn Mower')
Prior to that, I simply had:
SELECT artifact1.* FROM [sramp:baseArtifactType] AS artifact1 WHERE CONTAINS(artifact1,'Lawn Mower')