0 Replies Latest reply on Jun 15, 2012 6:37 PM by chris fortescue

    tika text extractor limit

    chris fortescue Newbie

      Greetings.

       

      I've been using modeshape to archive PDF files among other things.  Thank you very much.

       

      I recently noticed a problem that I hadn't initially noticed.  Larger PDF files -- in my case roughly 15 pages of text -- cause modeshape lucene indexing to fail because text extraction fails.  To be precise, a PDF with greater than 100K of text causes text extraction failure.

       

      I wrote a test that reveals the problem but I don't really know how to fix it for production.  I used a 2M scanned PDF with an invisible 136K layer of OCR'd text.

       

      The problem eminates from here:

       

      http://tika.apache.org/0.7/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler%28%29

      The default for org.apache.tika.sax.BodyContentHandler is 100K.  The exception mentioned is indeed thrown but no test exposes it and it gets caught by the text extractor caller in production.

       

      The problem I have is how to make a configurable argument make it into the TikaTextExtractor class (to the hardcoded 2000000 below).   I arbitrarily set it in the following diff to confirm the problem source.

      {code}

      diff --git a/extensions/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java b/extensions/modeshape-extr

      index b02db7f..c125bc5 100644

      --- a/extensions/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java

      +++ b/extensions/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java

      @@ -133,10 +133,10 @@ public class TikaTextExtractor implements TextExtractor {

               Metadata metadata = prepareMetadata(stream, context);

       

               try {

      -            ContentHandler textHandler = new BodyContentHandler();

      +            ContentHandler textHandler = new BodyContentHandler(2000000);

                   // Parse the input stream ...

                   parser.parse(stream, textHandler, metadata, new ParseContext());

      {code}

       

       

      Since all of modeshape's Tika text extraction passes through this code, it follows that nobody can make searchable text for a document exceeding this 100k barrier at the moment even though that seems that impossible for that to be true.  I think it is.

       

      We produce 1000s of documents daily so I will fix it one way or the other for us, but I bet anyone expecting text extraction feature will need a fix for this too.  Either a giant, in memory buffer (ugh), or some kind of tmp persistence oriented implementation of org.apache.tika.ContentHandlerDecorator that can accommodate an arbitrarily large document (sometimes we have 100+ page documents) without squeezing process memory to much.   Either way, it would need to be configurable.

       

      Any ideas or comments on a course of action? 

       

      I wrote a simple test to show this. 

       

      {code}

      diff --git a/extensions/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java b/extensions/modeshape-

       

      index 53166f3..6393ce0 100644

      --- a/extensions/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java

      +++ b/extensions/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java

      @@ -129,6 +129,16 @@ public class TikaTextExtractorTest {

               extractTermsFrom("modeshape_gs.pdf");

               assertExtractedMatchesExpected();

           }

      +    @Test

      +    public void shouldExtractTextFromMongoPdf() throws IOException, SAXException, TikaException {

      +        extractTermsFrom("mongo.pdf");

      +        loadExpectedFrom("mongo.txt");

      +    }

      +

      {code}

       

      All you need to acutate the failure is a document with more than 100k of text in it.  The exception in production is swallowed in the bowels of modeshape.  Nothing emits.  With a test, however, the exception readily causes a failure.

       

      Over&out,

      Chris

       

      Message was edited by: chris fortescue I forgot to mention that this was done against 2.8.1