1 2 Previous Next 27 Replies Latest reply on Oct 16, 2012 4:11 PM by pkutrhari Go to original post
      • 15. Re: Yet another full text search question
        hchiorean

        Awesome, glad you've found the problem. Can you please open a JIRA feature request for this.

         

        Thanks

        • 16. Re: Yet another full text search question
          nl

          See MODE-1561.

           

          Thanks, Niels

          • 17. Re: Yet another full text search question
            pkutrhari

            Hi - I didn't want to create a new discussion as I am having similar issues (modeshape 2.8.3).

             

            I am not able to get any results back from the text in jcr:data. I have the following default extractor configuration -

            <mode:textExtractors>

                <mode:textExtractor jcr:name="Tika Text Extractors">

                            <mode:description>Text extractors using Tika parsers</mode:description>       

                            <mode:classname>org.modeshape.extractor.tika.TikaTextExtractor</mode:classname>

                  <mode:excludedMimeTypes>

                               application/x-archive,application/x-bzip,application/x-bzip2,

                               application/x-cpio,application/x-gtar,application/x-gzip,

                               application/x-ta,application/zip,application/vnd.teiid.vdb

                 </mode:excludedMimeTypes>

                </mode:textExtractor>

              </mode:textExtractors>

             

            The following query does not get any results (although the data exists).

             

            SELECT [jcr:path] FROM [nt:resource] WHERE contains([nt:resource].*,'1108290')

             

            The index seems to be built fine (without any modeshape exceptions in the logs...I have the log in debug mode) as the contains query on another field works fine. Please let me know if you think I'm missing something.

             

            Thanks.

            Praveen

            • 18. Re: Yet another full text search question
              rhauch

              Praveen,

               

              Do the logs contain any statements showing that the Tika text extractor is running and extracting the correct text? Have you tried using other full-text search with other criteria (e.g., something with characters rather than numbers)? You could also use Luke to look at the Lucene indexes.

              • 19. Re: Yet another full text search question
                pkutrhari

                Hi Randall -

                I don't see anything on Tika in the logs. Just this -

                <Oct 15 12:03:49 PM EDT> <DEBUG> <RepositoryQueryManager> - Reindexing synchronously (if missing)

                <Oct 15 12:03:49 PM EDT> <INFO > <RepositoryQueryManager> - Started rebuilding indexes for repository 'myrepo'

                <Oct 15 12:03:49 PM EDT> <DEBUG> <LuceneConfigurations> - Creating index folders for the 'default' workspace at '/Developer/modeshape/modeshape-index/default'

                <Oct 15 12:03:49 PM EDT> <DEBUG> <LuceneConfigurations> - Initializing index files for the 'default' workspace indexes under '/Developer/modeshape/modeshape-index/default'

                <Oct 15 12:04:07 PM EDT> <INFO > <RepositoryQueryManager> - Completed rebuilding indexes for repository 'myrepo'

                 

                Is there anything I need to do besides adding the extractors to the config (and the tika jar to the classpath)?

                 

                Yes, I did run a text search criteria as well with no results. I'll try Luke and check the index.

                 

                Thanks.

                Praveen

                • 20. Re: Yet another full text search question
                  pkutrhari

                  Luke is throwing error -

                  org.apache.lucene.index.CorruptIndexException: Unknown format version: -11

                            at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:247)

                            at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:71)

                            at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:704)

                            at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:68)

                            at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)

                            at org.apache.lucene.index.IndexReader.open(IndexReader.java:375)

                            at org.getopt.luke.Luke.openIndex(Unknown Source)

                            at org.getopt.luke.Luke.openOk(Unknown Source)

                   

                  Tried rebuilding the index, but same error. The index seems to be fine as it returns results if I do a contains on another non-binary property.

                  • 21. Re: Yet another full text search question
                    pkutrhari

                    Nevermind..was using incorrect version of Luke.

                    • 22. Re: Yet another full text search question
                      pkutrhari

                      I only see the non-binary properties being indexed (Looking with Luke). Looks like the extractor is not getting triggered for some reason.

                       

                      Thanks.

                      Praveen

                      • 23. Re: Yet another full text search question
                        pkutrhari

                        If it matters, the mimetype of the content in jcr:data is application/x-director (DCRs from teamsite). I'm using the File connector to view the filesystem in modeshape.

                         

                        Thanks.

                        • 24. Re: Yet another full text search question
                          pkutrhari

                          Did a little more digging. The mimetypes returned by the DefaultParser in the following code in the TikaTextExtractor do not include the application/x-director mimetype that the dcr xml files have -

                           

                          protected DefaultParser initialize() {

                                  if (parser == null) {

                                      try {

                                          initLock.lock();

                                          if (parser == null) {

                                              parser = new DefaultParser(this.getClass().getClassLoader());

                                          }

                                          Map<MediaType, Parser> parsers = parser.getParsers();

                                          for (MediaType mediaType : parsers.keySet()) {

                                              // Don't use the toString() method, as it may append properties ...

                                              String mimeType = mediaType.getType() + "/" + mediaType.getSubtype();

                                              supportedMediaTypes.add(mimeType);

                                          }

                                      } finally {

                                          initLock.unlock();

                                      }

                                  }

                                  return parser;

                              }

                           

                          Any idea where I can configure this? The tika-mimetypes.xml in the classpath has the application/x-director, but for some reason looks like modeshape is ignoring that file.

                           

                          Thanks.

                          Praveen

                          • 25. Re: Yet another full text search question
                            rhauch

                            Tika does indeed know about the "application/x-director" MIME type, and even has a test case to check it. However, I do not see that Tika includes out of the box a parser (e.g., an "extractor" in ModeShape parlance) that can extract anything from a Shockwave file.

                             

                            You can either write this Tika parser yourself, or you can write a custom ModeShape text extractor to do that.

                             

                            Regards,

                             

                            Randall

                            • 26. Re: Yet another full text search question
                              pkutrhari

                              On further research, it looks like the dcr text is afterall being parsed correctly (I'm assuming the auto mime detector is at works). The following code in the extractor is called for each node and xml text is correct set in the output.recordText(). But the fts field is never set in the node.

                               

                                  public void extractFrom( InputStream stream,

                                                           TextExtractorOutput output,

                                                           TextExtractorContext context ) throws IOException {

                                      final DefaultParser parser = initialize();

                                      Metadata metadata = prepareMetadata(stream, context);

                               

                                      try {

                                          ContentHandler textHandler = new BodyContentHandler();

                                          // Parse the input stream ...

                                          parser.parse(stream, textHandler, metadata, new ParseContext());

                               

                                          // Record all of the text in the body ...

                                          output.recordText(textHandler.toString().trim());

                                      } catch (IOException e) {

                                          throw e;

                                      } catch (Throwable e) {

                                          context.getProblems().addError(e, TikaI18n.errorWhileExtractingTextFrom, context.getInputPath(), e.getMessage());

                                      }

                                  }

                              • 27. Re: Yet another full text search question
                                pkutrhari

                                Hi Randall - I went with your suggestion and wrote our own extractor and works great.

                                 

                                Thanks for pointing in the right direction.

                                 

                                Praveen

                                1 2 Previous Next