What does your binary store configuration look like ? (you need to configure a binary store - see Binary values - ModeShape 3 - Project Documentation Editor). How are you uploading binary data (PDFs) and searching it ?
I edited my original post to show my entire configuration file that shows how the binary storage is configured. I have no trouble saving and retrieving binaries from the repository. I just can't seem to get it to parse the PDF documents. All node data is searchable, just not the binary content itself.
Below is my code that saves a binary file.
Node file = folder.addNode(docName, "dms:file");
Node contentNode = file.addNode("jcr:content", "nt:resource");
Binary binary = session.getValueFactory().createBinary(inputStream);
You can try setting the log level to DEBUG for org.modeshape.extractor and see if any text is actually extracted via Tika (if it is, it will show up in the logging)
Also, make sure org.apache.pdfbox:pdfbox:jar:1.7.1 (a transitive dependency of Tika's is in your classpath).
Other than that, the only thing I can think of is debugging the actual text-extraction code which is here: modeshape/extractors/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java at 3.x ·…
The problem is that I am getting this exception:
Parsing exception while extracting text: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit.
How do I increase the limit? I would like there to be no limit on document size.
Tika's default behavior is to limit text extraction to 100k characters. There is a "writeLimit" attribute you can try to configure:
"name": "Tika content-based extractor",
but I don't know if Tika will work past 100k chars.