6 Replies Latest reply on Aug 31, 2015 10:00 PM by cathyben

    How to get text extraction to work

    rwoolf

      I am unable to search the text of my PDF document stored in modeshape.  According to the documentation All I have to do is provide the configuration as stated in the documentation to enable this.  I've confirmed that both the Tika core and Tika parsers jars are present.  But when I try this I am still not able to search on any of the text within the PDF.  I can only do FTS searches on node data, but not on content within the binary PDF file.  Is there something more that I need to do that is not documented?  I have been deleting my index and restarting modeshape, for which I am configured to rebuild on startup.  Below is my configuration.

       

      {

          "name" : "Persisted-Repository",

           "monitoring" : {

              "enabled" : false

            },

            "workspaces" : {

              "predefined" : ["otherWorkspace"],

              "default" : "default",

              "allowCreation" : true

          },

          "security" : {

              "anonymous" : {

                  "roles" : ["readonly","readwrite","admin"],

                  "useOnFailedLogin" : false

              }

          },

          "storage" : {

              "cacheConfiguration" : "infinispan-configuration.xml",

              "cacheName" : "persisted-repository",

              "binaryStorage" : {

                  "type" : "file",

                  "directory": "/home/ross/testrep/target/binaries",

                  "minimumBinarySizeInBytes" : 999

              }

          },

              "query" : {

              "enabled" : true,

              "indexStorage" : {

                  "type" : "filesystem",

                  "location" : "/home/ross/testrep/target/indexes"

              },

              "textExtracting": {

                  "extractors" : {

                      "tikaExtractor":{

                          "name" : "General content-based extractor",

                          "classname" : "tika"

                      }

                  }

               },

              "indexing" : {

                  "rebuildOnStartup" : {

                      "when" : "if_missing",

                      "mode" : "async"

                  }

              }

      }

      }