Yet another full text search question| JBoss.org Content Archive (Read Only)

1 2 Previous Next 27 Replies Latest reply on Oct 16, 2012 4:11 PM by pkutrhari Go to original post

15. Re: Yet another full text search question

hchiorean Jul 18, 2012 8:09 AM (in response to nl)

Awesome, glad you've found the problem. Can you please open a JIRA feature request for this.

Thanks
Actions
16. Re: Yet another full text search question

nl Jul 18, 2012 8:29 AM (in response to hchiorean)

See MODE-1561.

Thanks, Niels
Actions
17. Re: Yet another full text search question

pkutrhari Oct 15, 2012 11:27 AM (in response to nl)

Hi - I didn't want to create a new discussion as I am having similar issues (modeshape 2.8.3).

I am not able to get any results back from the text in jcr:data. I have the following default extractor configuration -
<mode:textExtractors>
    <mode:textExtractor jcr:name="Tika Text Extractors">
                <mode:description>Text extractors using Tika parsers</mode:description>
                <mode:classname>org.modeshape.extractor.tika.TikaTextExtractor</mode:classname>
      <mode:excludedMimeTypes>
                   application/x-archive,application/x-bzip,application/x-bzip2,
                   application/x-cpio,application/x-gtar,application/x-gzip,
                   application/x-ta,application/zip,application/vnd.teiid.vdb
     </mode:excludedMimeTypes>
    </mode:textExtractor>
</mode:textExtractors>

The following query does not get any results (although the data exists).

SELECT [jcr:path] FROM [nt:resource] WHERE contains([nt:resource].*,'1108290')

The index seems to be built fine (without any modeshape exceptions in the logs...I have the log in debug mode) as the contains query on another field works fine. Please let me know if you think I'm missing something.

Thanks.
Praveen
Actions
18. Re: Yet another full text search question

rhauch Oct 15, 2012 11:37 AM (in response to pkutrhari)

Praveen,

Do the logs contain any statements showing that the Tika text extractor is running and extracting the correct text? Have you tried using other full-text search with other criteria (e.g., something with characters rather than numbers)? You could also use Luke to look at the Lucene indexes.
Actions
19. Re: Yet another full text search question

pkutrhari Oct 15, 2012 12:09 PM (in response to rhauch)

Hi Randall -
I don't see anything on Tika in the logs. Just this -
<Oct 15 12:03:49 PM EDT> <DEBUG> <RepositoryQueryManager> - Reindexing synchronously (if missing)
<Oct 15 12:03:49 PM EDT> <INFO > <RepositoryQueryManager> - Started rebuilding indexes for repository 'myrepo'
<Oct 15 12:03:49 PM EDT> <DEBUG> <LuceneConfigurations> - Creating index folders for the 'default' workspace at '/Developer/modeshape/modeshape-index/default'
<Oct 15 12:03:49 PM EDT> <DEBUG> <LuceneConfigurations> - Initializing index files for the 'default' workspace indexes under '/Developer/modeshape/modeshape-index/default'
<Oct 15 12:04:07 PM EDT> <INFO > <RepositoryQueryManager> - Completed rebuilding indexes for repository 'myrepo'

Is there anything I need to do besides adding the extractors to the config (and the tika jar to the classpath)?

Yes, I did run a text search criteria as well with no results. I'll try Luke and check the index.

Thanks.
Praveen
Actions
20. Re: Yet another full text search question

pkutrhari Oct 15, 2012 12:42 PM (in response to pkutrhari)

Luke is throwing error -
org.apache.lucene.index.CorruptIndexException: Unknown format version: -11
          at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:247)
          at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:71)
          at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704)
          at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:68)
          at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
          at org.apache.lucene.index.IndexReader.open(IndexReader.java:375)
          at org.getopt.luke.Luke.openIndex(Unknown Source)
          at org.getopt.luke.Luke.openOk(Unknown Source)

Tried rebuilding the index, but same error. The index seems to be fine as it returns results if I do a contains on another non-binary property.
Actions
21. Re: Yet another full text search question

pkutrhari Oct 15, 2012 12:50 PM (in response to pkutrhari)

Nevermind..was using incorrect version of Luke.
Actions
22. Re: Yet another full text search question

pkutrhari Oct 15, 2012 1:00 PM (in response to pkutrhari)

I only see the non-binary properties being indexed (Looking with Luke). Looks like the extractor is not getting triggered for some reason.

Thanks.
Praveen
Actions
23. Re: Yet another full text search question

pkutrhari Oct 15, 2012 1:34 PM (in response to pkutrhari)

If it matters, the mimetype of the content in jcr:data is application/x-director (DCRs from teamsite). I'm using the File connector to view the filesystem in modeshape.

Thanks.
Actions
24. Re: Yet another full text search question

pkutrhari Oct 15, 2012 5:06 PM (in response to pkutrhari)

Did a little more digging. The mimetypes returned by the DefaultParser in the following code in the TikaTextExtractor do not include the application/x-director mimetype that the dcr xml files have -

protected DefaultParser initialize() {
        if (parser == null) {
            try {
                initLock.lock();
                if (parser == null) {
                    parser = new DefaultParser(this.getClass().getClassLoader());
                }
                Map<MediaType, Parser> parsers = parser.getParsers();
                for (MediaType mediaType : parsers.keySet()) {
                    // Don't use the toString() method, as it may append properties ...
                    String mimeType = mediaType.getType() + "/" + mediaType.getSubtype();
                    supportedMediaTypes.add(mimeType);
                }
            } finally {
                initLock.unlock();
            }
        }
        return parser;
    }

Any idea where I can configure this? The tika-mimetypes.xml in the classpath has the application/x-director, but for some reason looks like modeshape is ignoring that file.

Thanks.
Praveen
Actions
25. Re: Yet another full text search question

rhauch Oct 16, 2012 10:08 AM (in response to pkutrhari)

Tika does indeed know about the "application/x-director" MIME type, and even has a test case to check it. However, I do not see that Tika includes out of the box a parser (e.g., an "extractor" in ModeShape parlance) that can extract anything from a Shockwave file.

You can either write this Tika parser yourself, or you can write a custom ModeShape text extractor to do that.

Regards,

Randall
Actions
26. Re: Yet another full text search question

pkutrhari Oct 16, 2012 12:19 PM (in response to rhauch)

On further research, it looks like the dcr text is afterall being parsed correctly (I'm assuming the auto mime detector is at works). The following code in the extractor is called for each node and xml text is correct set in the output.recordText(). But the fts field is never set in the node.

    public void extractFrom( InputStream stream,
                             TextExtractorOutput output,
                             TextExtractorContext context ) throws IOException {
        final DefaultParser parser = initialize();
        Metadata metadata = prepareMetadata(stream, context);

        try {
            ContentHandler textHandler = new BodyContentHandler();
            // Parse the input stream ...
            parser.parse(stream, textHandler, metadata, new ParseContext());

            // Record all of the text in the body ...
            output.recordText(textHandler.toString().trim());
        } catch (IOException e) {
            throw e;
        } catch (Throwable e) {
            context.getProblems().addError(e, TikaI18n.errorWhileExtractingTextFrom, context.getInputPath(), e.getMessage());
        }
    }
Actions
27. Re: Yet another full text search question

pkutrhari Oct 16, 2012 4:11 PM (in response to rhauch)

Hi Randall - I went with your suggestion and wrote our own extractor and works great.

Thanks for pointing in the right direction.

Praveen
Actions

1 2 Previous Next

Go to original post