9 Replies Latest reply on May 8, 2013 7:39 AM by hchiorean

    Full text based search in ModeShape

    m.jawwad

      Hi,

       

            In my application, I have some quries that were running fine with Jackrabbit supporting text based search. But after configuring my application to modeshape, they are not working especially with excerpt. Previously working query was:

       

                  SELECT excerpt(.) FROM nt:resource WHERE  CONTAINS(., 'some search text') order by jcr:title desc.

       

           This query returns all the nodes containing the text provided in the contains clause (using simple SQL). But it does not work in ModeShape using simple SQL or JCR_SQL2. Then I simplified the query to:

       

               SELECT excerpt FROM [nt:resource]

           and

               SELECT * FROM [nt:resource]

       

           These queries return me all the nodes in the repository but stilll I am unable to extract the excerpt. My question is that how can I do that in  mode shape? I want to run a text based search and then extract the excerpt as well

       

             Thanks in advance.

        • 1. Re: Full text based search in ModeShape
          rhauch

          "EXCERPT" is a Jackrabbit-specific feature that is not even mentioned in the JSR-283 specification. Feel free to log an enhancment request, but be sure to include links and/or documentation that describes the feature as implemented on Jackrabbit. The more information you provide, the less time we have to spend researching it before we try to implement. As always, we welcome any assistance in implementing it, too.

          • 2. Re: Full text based search in ModeShape
            m.jawwad

            Hi,

             

                  Thanks a lot for the reply. I wasnt aware that excerpt is a Jackrabbit specific feature (my bad) . So is there any way to perform a full text based search using ModeShape and is there any way I can extract some content out of a text file, pdf etc. to display in the customized search result? I want the user to enter a string i.e some text or a '*' if wants to search with an empty string in the repo. Is there a better way to query, based on some metadata regarding the node in modeshape?

             

                 Thanks again.

            • 3. Re: Full text based search in ModeShape
              rhauch

              Yes, you can certainly query using the standard full-text criteria (e.g., "CONTAINS" in JCR-SQL2) or using the full-text language (see here). You can even get the score (see here), which is maybe handy if you're ordering the results (the default order is from highest to lowest score).

               

              But unfortunately there is no way to obtain an excerpt of the matched text to include in your search results. That would require a feature like EXCERPT.

              • 4. Re: Full text based search in ModeShape
                m.jawwad

                Hi,

                 

                       Thanks a lot for the reply, I am moving on from the excerpt feature for now . But now I am unable to perform content based search. I want to search files based on their text/content, I tried the query:

                 

                     SELECT * FROM [nt:resource]  WHERE CONTAINS([nt:resource],'the') order by [jcr:title] asc

                 

                     What I want is that I want to search files that contains the text "the" in their content. But the search does not return any results and also query does not throw any exceptions. In my repo, I have some pdf and text files, but I dont get any results. If I search based on file name, I get the results, but for content based search I dont get any results . Can you tell me what is wrong with my query? I have also tried the syntax

                 

                      SELECT * FROM [nt:base] WHERE CONTAINS([nt:base],'full-text-query')

                 

                     Also tried '*' as a wild card in the above query but still no results..

                 

                     Thanks again.

                • 5. Re: Full text based search in ModeShape
                  rhauch

                  First of all, "the" is a stopword, which is removed from the tokenized text because it is so common. That's why doing a full-text search (via CONTAINS) for "the" returns no results. If you need to search for specific words or phrases, you can always use LIKE, such as "...[jcr:title] LIKE '% the %' ..."

                   

                  Next, be sure that you're actually searching for words that will be found as tokens in the uploaded text. Then, be sure that the PDF text extractor is enabled. Personally, I'd start with uploading a simple text file and work your way up.

                   

                  Finally, be sure your application gives some time between save and full-text searching. Extracting text and indexing those tokens takes a bit of time.

                  • 6. Re: Full text based search in ModeShape
                    m.jawwad

                    Hi,

                     

                          Thanks again, I wasn't aware about 'the' word. Moreover By pdf text extractor, do you mean adding the sequencers? In JBoss, these are the current sequencer enteries:

                     

                    --------------------

                    "result" => {

                    "allow-workspace-creation" => true,

                    "anonymous-roles" => undefined,

                    "anonymous-username" => "<anonymous>",

                    "authenticator" => undefined,

                    "cache-container" => "modeshape",

                    "cache-name" => "sample",

                    "cluster-name" => undefined,

                    "cluster-stack" => undefined,

                    "configuration" => undefined,

                    "default-initial-content" => undefined,

                    "default-workspace" => "default",

                    "enable-monitoring" => true,

                    "enable-queries" => true,

                    "garbage-collection-initial-time" => "00:00",

                    "garbage-collection-interval" => 24,

                    "garbage-collection-thread-pool" => "modeshape-gc",

                    "indexing-analyzer-classname" => "org.apache.lucene.analysis.standard.StandardAnalyzer",

                    "indexing-analyzer-module" => undefined,

                    "indexing-async-max-queue-size" => 1,

                    "indexing-async-thread-pool-size" => 1,

                    "indexing-batch-size" => -1,

                    "indexing-mode" => "SYNC",

                    "indexing-reader-strategy" => "SHARED",

                    "indexing-thread-pool" => "modeshape-indexing-workers",

                    "jndi-name" => undefined,

                    "minimum-binary-size" => 4096,

                    "minimum-string-size" => undefined,

                    "node-types" => undefined,

                    "predefined-workspace-names" => undefined,

                    "rebuild-indexes-upon-startup" => "IF_MISSING",

                    "security-domain" => "modeshape-security",

                    "source" => undefined,

                    "system-content-indexing-mode" => "DISABLED",

                    "text-extractor" => undefined,

                    "use-anonymous-upon-failed-authentication" => false,

                    "workspaces-cache-container" => undefined,

                    "workspaces-initial-content" => undefined,

                    "sequencer" => {

                    "delimited-text-sequencer2" => {

                                    "classname" => "org.modeshape.sequencer.text.DelimitedTextSequencer",

                                    "module" => "org.modeshape.sequencer.text",

                                    "path-expressions" => ["/files(//*.txt[*])/jcr:content[@jcr:data] => /derived/text/delimited/$1"],

                                    "properties" => [("splitPattern" => ",")]

                                },

                    "delimited-text-sequencer3" => {

                                    "classname" => "org.modeshape.sequencer.text.DelimitedTextSequencer",

                    "module" => "org.modeshape.sequencer.text",

                                    "path-expressions" => ["/files(//*.pdf])/jcr:content[@jcr:data]

                    => /derived/text/delimited/$1"],

                                    "properties" => [("splitPattern" => ",")]

                                },

                                "delimited-text-sequencer" => {

                                    "classname" => "org.modeshape.sequencer.text.DelimitedTextSequencer",

                                    "module" => "org.modeshape.sequencer.text",

                                    "path-expressions" => ["/files(//*.csv[*])/jcr:content[@jcr:data

                    ] => /derived/text/delimited/$1"],

                                    "properties" => [("splitPattern" => ",")]

                                }

                    }

                    }

                    }

                    ----------------------

                     

                    Is there anything else should I check?

                     

                    and my current query is:

                    SELECT * FROM [nt:resource] WHERE  CONTAINS([nt:resource], 'correct command') order by [jcr:title] desc

                    'correct command' is an actual phrase in the pdf.

                     

                                Thanks again.

                    • 8. Re: Full text based search in ModeShape
                      m.jawwad

                      Thanks. After reading your post, I added this to my stanalone-modeshape.xml.

                       

                            <text-extractor name="pdf-extractor" classname="pdfbox" module="org.apache.pdfbox"/>

                       

                      I am not using maven. Also I have pdfbox  jar in my JBoss under (\modules\org\apache\tika\1.2\), But now I am getting error while repository initialization:

                       

                           16:23:43,641 WARN  [org.modeshape.jboss.service] (MSC service thread 1-4) Cannot load module from (from classpath entry) with identifier: org.apache.pdfbox

                           16:23:43,645 ERROR [org.modeshape.jcr.TextExtractors] (MSC service thread 1-4) Unable to initialize the text extractor "pdf-extractor" for repository "sample": Unable to instantiate class pdfbox:           org.infinispan.CacheConfigurationException: Unable to instantiate class pdfbox

                       

                      and also:

                           Caused by: java.lang.ClassNotFoundException: pdfbox

                       


                       

                           Can you tell me what I might be missing? (although I am sure I am missing some configuration etc.)

                       

                      Thanks in advance

                      • 9. Re: Full text based search in ModeShape
                        hchiorean

                        You need to be aware of a couple of things:

                         

                        1) the only text extractor provided out-of-the-box by ModeShape is the Tika text extractor, which must be configured as suggested in the previous link

                        2) if you want to add your own text extractor, you need to extend the org.modeshape.jcr.api.text.TextExtractor and implement the required methods. After you have done this, you need to package your jar with the text extractor and deploy it in EAP (see below)

                        3) the <text-extractor name="tika-extractor" classname="tika" module="org.modeshape.extractor.tika"/> element has the following semantic:  

                             - name = a symbolic string, acts like an identifier for the extractor

                             - classname = unless it's exactly "tika" (which is preconfgured internally by ModeShape), it must be the fully qualified classname of your own implementation (see above)

                             - module = must be the folder structure, inside %EAP_HOME%/modules, where the text extractor code is located. The ModeShape kit installs %EAP_HOME%/modules/org/modeshape/extractor/tika which depends on /modules/org/apache/tika/1.2 (look at the module.xml file).

                         

                        My suggestion is that you use Tika exactly as described above. You only need to add that line to your EAP XML configuration. The Tika module contains, by default, pdf-box and can extract text from pdf files.

                         

                        Otherwise, you'll have to install your own modules in EAP and update ModeShape's module.xml files(s) with dependencies towards those.