7 Replies Latest reply on Jul 15, 2015 3:11 AM by hchiorean

    modeshape 4.1 searching in jcr:data

    caliskan

      Hi everybody,

      We have uploaded files such as images, word and pdf documents using modeshape 4.1. I wonder that it is possible to make search in word and pdf documents using JCR_SQ2.


      Best Regards.

        • 1. Re: modeshape 4.1 searching in jcr:data
          hchiorean

          There are 2 things you can search around binary values:

          1. the LENGTH of these properties (i.e. the size in bytes, as defined by the JCR spec). See JCR-SQL2 - ModeShape 4 - Project Documentation Editor
          2. the content of these properties if:
            • text extraction is enabled in the repository configuration (Tika text extractor - ModeShape 4 - Project Documentation Editor)
            • you've configured a persistent binary store (i.e. you're persisting the binaries somewhere)
            • the type of binary content is supported by the Tika text extractor (formats like PDF and DOC(x) are supported by default)

          If all of the above are true, then you can search for the binary content using FTS: JCR-SQL2 - ModeShape 4 - Project Documentation Editor

          • 2. Re: modeshape 4.1 searching in jcr:data
            caliskan

            Thank you, we saved files without doing any text extraction configuration, is it possible to make search old files content after making text extraction configuration right now. Another case our repository size is huge (~500 GB) . Is is logical to do this kind of process on modeshape? Should we find another solution, writing external application using lucene or something like that?

            • 3. Re: modeshape 4.1 searching in jcr:data
              hchiorean
              Thank you, we saved files without doing any text extraction configuration, is it possible to make search old files content after making text extraction configuration right now.

              If you update your configuration, add the text extractors and restart, text extraction should be performed the next time you execute a FTS query on a property (btw, please make sure you're using the latest version of ModeShape - 4.3.0.Final since we're constantly fixing bugs)

              Another case our repository size is huge (~500 GB) . Is is logical to do this kind of process on modeshape? Should we find another solution, writing external application using lucene or something like that?

              ModeShape is primarily a JCR implementation, not a database/data store. That being said, we've tried implementing the binary storage feature as optimally as possible, trying not to "overload" regular JCR operations (see Binary values - ModeShape 4 - Project Documentation Editor). Size wise, 500GB shouldn't really be an issue since binary values are stored outside of the repository. Performance wise, I recommend you prototype/test different solutions and decide based on that. As I said, ModeShape is primarily a JCR implementation and as such, there are a lot of constraints & extra processing steps that are required compared for example to a loosely structured document store like MongoDB.

              • 4. Re: modeshape 4.1 searching in jcr:data
                caliskan

                Thaks Horia,

                I will consider all this information.

                • 5. Re: modeshape 4.1 searching in jcr:data
                  caliskan

                  By the way, our implementation is using external file system repository. After this information given, Does the case change?

                  • 6. Re: modeshape 4.1 searching in jcr:data
                    caliskan

                    Hi Horia,

                    I made a configuration

                    <text-extractors>
                      <text-extractor name="tika-extractor" classname="tika" module="org.modeshape.extractor.tika"/>
                      </text-extractors>

                     

                    but ı am getting an exception as

                    "org.infinispan.persistence.spi.PersistenceException: java.io.StreamCorruptedException: Unexpected byte found when reading an object"


                    try {
                      String jql="SELECT * FROM [nt:file] WHERE CONTAINS(*,'cats')";
                      QueryManager queryManager=session.getWorkspace().getQueryManager();
                      Query query= queryManager.createQuery(jql,Query.JCR_SQL2);
                      QueryResult queryResult=query.execute();
                      NodeIterator nodeIter=queryResult.getNodes();
                      while(nodeIter.hasNext()){
                      Node node=nodeIter.nextNode();
                      String asd=node.getPath();
                      }
                      } catch (Exception e) {
                      e.printStackTrace();
                      }

                    • 7. Re: modeshape 4.1 searching in jcr:data
                      hchiorean

                      Text extraction & Infinispan in general are unrelated. If you can reproduce this in a test case with ModeShape 4.3 & ISPN 7 and you can provide a test-case we can run locally, feel free to open a JIRA. Thanks.