6 Replies Latest reply on Aug 31, 2015 10:00 PM by cathyben

How to get text extraction to work

rwoolf Jun 11, 2014 12:10 PM

I am unable to search the text of my PDF document stored in modeshape. According to the documentation All I have to do is provide the configuration as stated in the documentation to enable this. I've confirmed that both the Tika core and Tika parsers jars are present. But when I try this I am still not able to search on any of the text within the PDF. I can only do FTS searches on node data, but not on content within the binary PDF file. Is there something more that I need to do that is not documented? I have been deleting my index and restarting modeshape, for which I am configured to rebuild on startup. Below is my configuration.

{

"name" : "Persisted-Repository",

"monitoring" : {

"enabled" : false

"workspaces" : {

"predefined" : ["otherWorkspace"],

"default" : "default",

"allowCreation" : true

"security" : {

"anonymous" : {

"roles" : ["readonly","readwrite","admin"],

"useOnFailedLogin" : false

}

"storage" : {

"cacheConfiguration" : "infinispan-configuration.xml",

"cacheName" : "persisted-repository",

"binaryStorage" : {

"type" : "file",

"directory": "/home/ross/testrep/target/binaries",

"minimumBinarySizeInBytes" : 999

}

"query" : {

"enabled" : true,

"indexStorage" : {

"type" : "filesystem",

"location" : "/home/ross/testrep/target/indexes"

"textExtracting": {

"extractors" : {

"tikaExtractor":{

"name" : "General content-based extractor",

"classname" : "tika"

}

"indexing" : {

"rebuildOnStartup" : {

"when" : "if_missing",

"mode" : "async"

}

1. Re: How to get text extraction to work

hchiorean Jun 11, 2014 3:37 AM (in response to rwoolf)

What does your binary store configuration look like ? (you need to configure a binary store - see Binary values - ModeShape 3 - Project Documentation Editor). How are you uploading binary data (PDFs) and searching it ?
Actions
2. Re: How to get text extraction to work

rwoolf Jun 11, 2014 1:43 PM (in response to hchiorean)

I edited my original post to show my entire configuration file that shows how the binary storage is configured. I have no trouble saving and retrieving binaries from the repository. I just can't seem to get it to parse the PDF documents. All node data is searchable, just not the binary content itself.

Below is my code that saves a binary file.
...
Node file = folder.addNode(docName, "dms:file");
Node contentNode = file.addNode("jcr:content", "nt:resource");
contentNode.addMixin("mix:versionable");
Binary binary = session.getValueFactory().createBinary(inputStream);
contentNode.setProperty("jcr:data", binary);
session.save();
...
Actions
3. Re: How to get text extraction to work

hchiorean Jun 12, 2014 4:19 AM (in response to rwoolf)

You can try setting the log level to DEBUG for org.modeshape.extractor and see if any text is actually extracted via Tika (if it is, it will show up in the logging)
Also, make sure org.apache.pdfbox:pdfbox:jar:1.7.1 (a transitive dependency of Tika's is in your classpath).

Other than that, the only thing I can think of is debugging the actual text-extraction code which is here: modeshape/extractors/modeshape-extractor-tika/src/main/java/org/modeshape/extractor/tika/TikaTextExtractor.java at 3.x ·…
Actions
4. Re: How to get text extraction to work

rwoolf Jun 24, 2014 4:25 PM (in response to hchiorean)

The problem is that I am getting this exception:
Parsing exception while extracting text: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit.

How do I increase the limit? I would like there to be no limit on document size.
Actions
5. Re: How to get text extraction to work

hchiorean Jun 30, 2014 1:25 AM (in response to rwoolf)

Tika's default behavior is to limit text extraction to 100k characters. There is a "writeLimit" attribute you can try to configure:
    "tikaExtractor": {
                "name": "Tika content-based extractor",
                "classname": "tika",
                "writeLimit": 100
            }
...
but I don't know if Tika will work past 100k chars.
Actions
6. Re: How to get text extraction to work

cathyben Aug 31, 2015 10:00 PM (in response to rwoolf)

I found another guide about extracting PDF text.
You can refer to it.
Actions

Go to original post