0 Replies Latest reply on Jun 15, 2012 6:37 PM by f4tq

tika text extractor limit

f4tq Jun 15, 2012 6:37 PM

Greetings.

I've been using modeshape to archive PDF files among other things. Thank you very much.

I recently noticed a problem that I hadn't initially noticed. Larger PDF files -- in my case roughly 15 pages of text -- cause modeshape lucene indexing to fail because text extraction fails. To be precise, a PDF with greater than 100K of text causes text extraction failure.

I wrote a test that reveals the problem but I don't really know how to fix it for production. I used a 2M scanned PDF with an invisible 136K layer of OCR'd text.

The problem eminates from here:

http://tika.apache.org/0.7/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler%28%29

The default for org.apache.tika.sax.BodyContentHandler is 100K. The exception mentioned is indeed thrown but no test exposes it and it gets caught by the text extractor caller in production.

The problem I have is how to make a configurable argument make it into the TikaTextExtractor class (to the hardcoded 2000000 below). I arbitrarily set it in the following diff to confirm the problem source.

{code}
diff --git a/extensions/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java b/extensions/modeshape-extr
index b02db7f..c125bc5 100644
--- a/extensions/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java
+++ b/extensions/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java
@@ -133,10 +133,10 @@ public class TikaTextExtractor implements TextExtractor {
         Metadata metadata = prepareMetadata(stream, context);
 
         try {
-            ContentHandler textHandler = new BodyContentHandler();
+            ContentHandler textHandler = new BodyContentHandler(2000000);
             // Parse the input stream ...
             parser.parse(stream, textHandler, metadata, new ParseContext());
 {code}

Since all of modeshape's Tika text extraction passes through this code, it follows that nobody can make searchable text for a document exceeding this 100k barrier at the moment even though that seems that impossible for that to be true. I think it is.

We produce 1000s of documents daily so I will fix it one way or the other for us, but I bet anyone expecting text extraction feature will need a fix for this too. Either a giant, in memory buffer (ugh), or some kind of tmp persistence oriented implementation of org.apache.tika.ContentHandlerDecorator that can accommodate an arbitrarily large document (sometimes we have 100+ page documents) without squeezing process memory to much. Either way, it would need to be configurable.

Any ideas or comments on a course of action?

I wrote a simple test to show this.

{code}
diff --git a/extensions/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java b/extensions/modeshape-
 
index 53166f3..6393ce0 100644
--- a/extensions/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java
+++ b/extensions/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java
@@ -129,6 +129,16 @@ public class TikaTextExtractorTest {
         extractTermsFrom("modeshape_gs.pdf");
         assertExtractedMatchesExpected();
     }
+    @Test
+    public void shouldExtractTextFromMongoPdf() throws IOException, SAXException, TikaException {
+        extractTermsFrom("mongo.pdf");
+        loadExpectedFrom("mongo.txt");
+    }
+
{code}

All you need to acutate the failure is a document with more than 100k of text in it. The exception in production is swallowed in the bowels of modeshape. Nothing emits. With a test, however, the exception readily causes a failure.

Over&out,

Chris

Message was edited by: chris fortescue I forgot to mention that this was done against 2.8.1