tika text extractor limit
f4tq Jun 15, 2012 6:37 PMGreetings.
I've been using modeshape to archive PDF files among other things. Thank you very much.
I recently noticed a problem that I hadn't initially noticed. Larger PDF files -- in my case roughly 15 pages of text -- cause modeshape lucene indexing to fail because text extraction fails. To be precise, a PDF with greater than 100K of text causes text extraction failure.
I wrote a test that reveals the problem but I don't really know how to fix it for production. I used a 2M scanned PDF with an invisible 136K layer of OCR'd text.
The problem eminates from here:
http://tika.apache.org/0.7/api/org/apache/tika/sax/BodyContentHandler.html#BodyContentHandler%28%29
The default for org.apache.tika.sax.BodyContentHandler is 100K. The exception mentioned is indeed thrown but no test exposes it and it gets caught by the text extractor caller in production.
The problem I have is how to make a configurable argument make it into the TikaTextExtractor class (to the hardcoded 2000000 below). I arbitrarily set it in the following diff to confirm the problem source.
{code}
diff --git a/extensions/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java b/extensions/modeshape-extr
index b02db7f..c125bc5 100644
--- a/extensions/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java
+++ b/extensions/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java
@@ -133,10 +133,10 @@ public class TikaTextExtractor implements TextExtractor {
Metadata metadata = prepareMetadata(stream, context);
try {
- ContentHandler textHandler = new BodyContentHandler();
+ ContentHandler textHandler = new BodyContentHandler(2000000);
// Parse the input stream ...
parser.parse(stream, textHandler, metadata, new ParseContext());
{code}
Since all of modeshape's Tika text extraction passes through this code, it follows that nobody can make searchable text for a document exceeding this 100k barrier at the moment even though that seems that impossible for that to be true. I think it is.
We produce 1000s of documents daily so I will fix it one way or the other for us, but I bet anyone expecting text extraction feature will need a fix for this too. Either a giant, in memory buffer (ugh), or some kind of tmp persistence oriented implementation of org.apache.tika.ContentHandlerDecorator that can accommodate an arbitrarily large document (sometimes we have 100+ page documents) without squeezing process memory to much. Either way, it would need to be configurable.
Any ideas or comments on a course of action?
I wrote a simple test to show this.
{code}
diff --git a/extensions/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java b/extensions/modeshape-
index 53166f3..6393ce0 100644
--- a/extensions/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java
+++ b/extensions/modeshape-extractor-tika/src/test/java/org/modeshape/extractor/tika/TikaTextExtractorTest.java
@@ -129,6 +129,16 @@ public class TikaTextExtractorTest {
extractTermsFrom("modeshape_gs.pdf");
assertExtractedMatchesExpected();
}
+ @Test
+ public void shouldExtractTextFromMongoPdf() throws IOException, SAXException, TikaException {
+ extractTermsFrom("mongo.pdf");
+ loadExpectedFrom("mongo.txt");
+ }
+
{code}
All you need to acutate the failure is a document with more than 100k of text in it. The exception in production is swallowed in the bowels of modeshape. Nothing emits. With a test, however, the exception readily causes a failure.
Over&out,
Chris
Message was edited by: chris fortescue I forgot to mention that this was done against 2.8.1