-
1. Re: How to get text extraction to work
hchiorean Jun 11, 2014 3:37 AM (in response to rwoolf)What does your binary store configuration look like ? (you need to configure a binary store - see Binary values - ModeShape 3 - Project Documentation Editor). How are you uploading binary data (PDFs) and searching it ?
-
2. Re: How to get text extraction to work
rwoolf Jun 11, 2014 1:43 PM (in response to hchiorean)I edited my original post to show my entire configuration file that shows how the binary storage is configured. I have no trouble saving and retrieving binaries from the repository. I just can't seem to get it to parse the PDF documents. All node data is searchable, just not the binary content itself.
Below is my code that saves a binary file.
...
Node file = folder.addNode(docName, "dms:file");
Node contentNode = file.addNode("jcr:content", "nt:resource");
contentNode.addMixin("mix:versionable");
Binary binary = session.getValueFactory().createBinary(inputStream);
contentNode.setProperty("jcr:data", binary);
session.save();
...
-
3. Re: How to get text extraction to work
hchiorean Jun 12, 2014 4:19 AM (in response to rwoolf)You can try setting the log level to DEBUG for org.modeshape.extractor and see if any text is actually extracted via Tika (if it is, it will show up in the logging)
Also, make sure org.apache.pdfbox:pdfbox:jar:1.7.1 (a transitive dependency of Tika's is in your classpath).
Other than that, the only thing I can think of is debugging the actual text-extraction code which is here: modeshape/extractors/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java at 3.x ·…
-
4. Re: How to get text extraction to work
rwoolf Jun 24, 2014 4:25 PM (in response to hchiorean)The problem is that I am getting this exception:
Parsing exception while extracting text: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit.
How do I increase the limit? I would like there to be no limit on document size.
-
5. Re: How to get text extraction to work
hchiorean Jun 30, 2014 1:25 AM (in response to rwoolf)Tika's default behavior is to limit text extraction to 100k characters. There is a "writeLimit" attribute you can try to configure:
"tikaExtractor": {
"name": "Tika content-based extractor",
"classname": "tika",
"writeLimit": 100
}
...
but I don't know if Tika will work past 100k chars.
-
6. Re: How to get text extraction to work
cathyben Aug 31, 2015 10:00 PM (in response to rwoolf)I found another guide about extracting PDF text.
You can refer to it.