1 Reply Latest reply on Apr 14, 2016 5:53 AM by hchiorean

    Full-Text search for binary content using ModeShape 4.5.0

    roykaushik

      I am pretty new to JCR and ModeShape and I  am using ModeShape version 4.5.0 in my application for document storage functionality. All the modehshape jars including tika-core-1.8.jar are present in my classpath. I am able to persist and retrieve binary data successfully;however when I perform full-text search on binary content I am not seeing expected results. My modeshape repository config JSON file looks like below :-

       

      {

          "name" : "DocumentRepository",

          "workspaces" : {

              "predefined" : ["otherWorkspace"],

              "default" : "default",

              "allowCreation" : true

          },

          "security" : {

              "anonymous" : {

                  "roles" : ["readonly","readwrite","admin"],

                  "useOnFailedLogin" : false

              }

          },

          "storage" : {

              "cacheName" : "WSDocumentCache",

              "binaryStorage" : {

                  "type" : "database",

                  "driverClass" : "${modeshape.database.driverclass}",

                  "username" : "${modeshape.database.username}",

                  "password" : "${modeshape.database.password}",

                  "url" : "${modeshape.database.url}"

              }

          },

          "textExtraction": {

        "extractors" : {

        "tikaExtractor":{

          "name" : "Tika content-based extractor",

            "classname" : "tika"

        }

        }

        }

      }

       

      A snippet of my full-text search query and the corresponding java code to retrieve the data is as follows :-

       

      String jql = "SELECT file.* FROM [nt:file] AS file INNER JOIN [nt:resource] AS data " +

        "ON ISCHILDNODE(data , file) WHERE CONTAINS(data.[jcr:data], $searchText)";

        QueryManager queryManager = jcrSession.getWorkspace().getQueryManager();

        Query query = queryManager.createQuery(jql, Query.JCR_SQL2);

        Value tag = jcrSession.getValueFactory().createValue("New screen to be able to add");

        query.bindValue("searchText", tag);

       

        QueryResult queryResult = query.execute();

        RowIterator rowIter = queryResult.getRows();

        logger.info("Total no of rows returned by the query :: " + rowIter.getSize());

      //further code to get the rows and the row data

       

      With the above query I am always getting the RowIterator size as 0 althought I have multiple documents already saved in the database which contains the search text "New screen to be able to add". I tried replacing INNER JOIN with LEFT JOIN but that is returning me all nt:file nodes irrespective of whether the child nt:resource node contains the search text or not.

       

      I am a bit stuck up on this one and I need to move ahead and any quick help on this is highly appreciated.

       

      Thanks,

      Kaushik

        • 1. Re: Full-Text search for binary content using ModeShape 4.5.0
          hchiorean

          first, what types of files are you expecting text to be extracted from ? Is it something which Tika supports ?  If yes, then you should enable debug logging to make sure Modeshape extracts the text.

           

          query-wise, you don't need any joins. Just select all the nt:resource files and then use getParent to get the actual file. If you still don't get any results, you can do a quick test and search for the hardcoded string, not a variable to check if there is a potential bug or not.