1 2 Previous Next 15 Replies Latest reply on May 1, 2017 5:08 PM by dalbani

    Running text extractors synchronously

    dalbani

      Hello,

       

      My experience with text extractors (in combination with S3 binary storage) is that the extraction is triggered by a search, i.e. not when the nodes are added to the repository.

      Am I wrong in this observation? If not, is it possible to configure ModeShape to run the text extraction "synchronously", at node insertion time?

       

      Thanks.

        • 1. Re: Running text extractors synchronously
          dalbani

          Hum, I suppose that could be related to the default index provider??

          Because using a Lucene index provider does make text extractors run synchronously.

           

          Speaking of the Lucene index provider, why does ModeShape insist on fetching the extracted text from the binary storage when making a full-text search like:

           

          SELECT resource.[jcr:path]

          FROM [nt:resource] AS resource

          WHERE CONTAINS(resource.[jcr:data], 'text')

           

          I suppose that comes from the call to "RowExtractors.extractFullText()" in "ScanningQueryEngine" — but what's the point of fetching this information?

          And that makes queries unnecessarily slow when combined with a remote store like S3.

          • 2. Re: Running text extractors synchronously
            hchiorean

            Text extraction only exists for full text search and is a highly costly performance wise operation. As such, triggering it each time a binary is inserted is not ok.

             

            Therefore, text is extracted and stored in the binary store lazily, only when a FTS query is performed. Now, as mentioned in our docs, not all index providers support FTS:

            • when no indexes are defined, FTS is supported via the ScanningQueryEngine
            • the local index provider (MapDB) does not support FTS indexes
            • Lucene does support FTS
            • ES does support FTS

            I suppose that comes from the call to "RowExtractors.extractFullText()" in "ScanningQueryEngine" — but what's the point of fetching this information?

            And that makes queries unnecessarily slow when combined with a remote store like S3.

             

            text extraction is an optional feature, so if you don't need it, don't enable it in your configuration.

            • 3. Re: Running text extractors synchronously
              dalbani

              I do need FTS, that's a must-have feature in my use case. That's why I want it to perform as good as possible — and the first step is to understand how it actually works in ModeShape

               

              Text extraction only exists for full text search and is a highly costly performance wise operation. As such, triggering it each time a binary is inserted is not ok.

              To be clear, my remark was about the fact that text extraction, if enabled, does happens at insertion time for Lucene / ElasticSearch index providers, but not with the local index provider, which does it on the fly when executing a FTS query.

              But I suppose that comes from the feature set of the different index providers?

               

              text extraction is an optional feature, so if you don't need it, don't enable it in your configuration.

              The point I was trying to make is that even with, say, a Lucene FTS index in place, I see that ModeShape call getExtractedText() on the (S3) binary store.

              I just don't understand why ModeShape does so when running a FTS query which simply ask for the [jcr:path] attribute for example?!

              • 4. Re: Running text extractors synchronously
                hchiorean

                To be clear, my remark was about the fact that text extraction, if enabled, does happens at insertion time for Lucene / ElasticSearch index providers, but not with the local index provider, which does it on the fly when executing a FTS query.

                But I suppose that comes from the feature set of the different index providers?

                as mentioned previously, there is not FTS for the local index provider. So I presume you're referring to the difference between *no indexes* (default ScanningQueryEngine behavior) and Lucene (or other index providers). In the first case, when the query is executed it synchronously (always) asks the binary store for the extracted text and uses that value. In the case of the index providers, when a binary is inserted/updated, the process of writing data to a text index is done via an event, either synchronously or asynchronously based on your individual index configuration.

                 

                There is a fundamental difference in code & logic when running with *no indexes* or *some indexes*.

                 

                The point I was trying to make is that even with, say, a Lucene FTS index in place, I see that ModeShape call getExtractedText() on the (S3) binary store.

                I just don't understand why ModeShape does so when running a FTS query which simply ask for the [jcr:path] attribute for example?!

                So if you have a Lucene text index on a property and you add/edit that property, each time text extraction should be triggered to get the new value and write it to the index (note that this has nothing to do with running the actual query)

                If you're running *any query* that does not include a FTS constraint (i.e. CONTAINS) then at no point during the execution of the query should the binary store getText() method be called. If you have a local test case which shows otherwise, feel free to log a JIRA.

                • 5. Re: Running text extractors synchronously
                  dalbani

                  Thanks for the explanation about indexes.

                   

                  If you're running *any query* that does not include a FTS constraint (i.e. CONTAINS) then at no point during the execution of the query should the binary store getText() method be called. If you have a local test case which shows otherwise, feel free to log a JIRA.

                  Well, let's say I run this query, having a Lucene index on nt:resource / jcr:data(BINARY):

                   

                  SELECT resource.[jcr:path]
                  FROM [nt:resource] AS resource
                  WHERE CONTAINS(resource.[jcr:data], 'text')
                  

                   

                  It does uses a CONTAINS constraint, so that's a FTS query.

                  Given the results returned by ModeShape, I see that the Lucene index is indeed used.

                  But then ModeShape calls getExtractedText() anyway on each result afterwards?! (Imposing a large performance penalty on such a small query...)

                  Is it an expected behavior of ModeShape?

                  • 6. Re: Running text extractors synchronously
                    hchiorean

                    But then ModeShape calls getExtractedText() anyway on each result afterwards?! (Imposing a large performance penalty on such a small query...)

                    Is it an expected behavior of ModeShape?

                    I don't understand what you mean here. If you can post either a stack trace or something I may be able to tell more.

                    • 7. Re: Running text extractors synchronously
                      dalbani

                      For each result return by the FTS query shown above, the following gets called:

                       

                      11:55:03,341 ERROR [stderr] (default task-57) java.lang.Exception: Stack trace
                      11:55:03,341 ERROR [stderr] (default task-57) at java.lang.Thread.dumpStack(Thread.java:1333)
                      11:55:03,341 ERROR [stderr] (default task-57) at org.modeshape.jcr.value.binary.S3BinaryStore.getExtractedText(S3BinaryStore.java:235)
                      11:55:03,341 ERROR [stderr] (default task-57) at org.modeshape.jcr.value.binary.AbstractBinaryStore.getText(AbstractBinaryStore.java:116)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.RowExtractors.extractTextFromSingleProperty(RowExtractors.java:209)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.RowExtractors.extractFullTextFrom(RowExtractors.java:200)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.engine.ScanningQueryEngine$15.getValueInRow(ScanningQueryEngine.java:1628)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.engine.ScanningQueryEngine$DynamicOperandFilter.isCurrentRowValid(ScanningQueryEngine.java:2611)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.NodeSequence$23.findNext(NodeSequence.java:1879)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.NodeSequence$23.hasNext(NodeSequence.java:1862)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.NodeSequence$LimitBatch.hasNext(NodeSequence.java:835)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.NodeSequence$23.findNext(NodeSequence.java:1876)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.NodeSequence$23.hasNext(NodeSequence.java:1862)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.NodeSequence$SingleWidthBatch.<init>(NodeSequence.java:1933)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.NodeSequence.copy(NodeSequence.java:1920)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.engine.process.RestartableSequence.loadBatch(RestartableSequence.java:213)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.engine.process.RestartableSequence$1.nextBatch(RestartableSequence.java:92)
                      11:55:03,342 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.engine.process.RestartableSequence.nextBatch(RestartableSequence.java:119)
                      11:55:03,343 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.JcrQueryResult$QueryResultIterator.findNextBatch(JcrQueryResult.java:257)
                      11:55:03,343 ERROR [stderr] (default task-57) at org.modeshape.jcr.query.JcrQueryResult$QueryResultIterator.hasNext(JcrQueryResult.java:241)
                      11:55:03,343 ERROR [stderr] (default task-57) at org.modeshape.web.server.JcrServiceImpl.query(JcrServiceImpl.java:490)
                      11:55:03,343 ERROR [stderr] (default task-57) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                      11:55:03,343 ERROR [stderr] (default task-57) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                      11:55:03,343 ERROR [stderr] (default task-57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                      11:55:03,343 ERROR [stderr] (default task-57) at java.lang.reflect.Method.invoke(Method.java:498)
                      11:55:03,343 ERROR [stderr] (default task-57) at com.google.gwt.user.server.rpc.RPC.invokeAndEncodeResponse(RPC.java:561)
                      11:55:03,343 ERROR [stderr] (default task-57) at com.google.gwt.user.server.rpc.RemoteServiceServlet.processCall(RemoteServiceServlet.java:208)
                      11:55:03,343 ERROR [stderr] (default task-57) at com.google.gwt.user.server.rpc.RemoteServiceServlet.processPost(RemoteServiceServlet.java:248)
                      11:55:03,343 ERROR [stderr] (default task-57) at com.google.gwt.user.server.rpc.AbstractRemoteServiceServlet.doPost(AbstractRemoteServiceServlet.java:62)
                      

                       

                      By the way, I ran the query via the ModeShape Explorer GUI: that makes no difference, doesn't it?

                      • 8. Re: Running text extractors synchronously
                        hchiorean

                        Running via the GUI makes no difference.

                         

                        The stack trace indicates that for this particular query, no indexes are being used and the default ScanningQueryEngine is used. Since that is the case, this is 100% the correct behavior as ModeShape cannot tell if a particular node matches your criteria without a) reading the node first and b) extracting the text for each node.

                         

                        Were indexes being used (i.e. Lucene) this stack trace would look completely different for your query. So either you don't have indexes or your index definitions are not used for this query (if you look at the query plan you should be able to tell if/what indexes are being used)

                        • 9. Re: Running text extractors synchronously
                          dalbani

                          Then there's probably an issue of why the index doesn't get used.

                           

                          Here's the configuration that I use:

                                          <indexes>
                                              <index name="nt-resource_jcr-name" provider-name="lucene" kind="value" synchronous="true" node-type="nt:resource" columns="jcr:name(NAME)"/>
                                              <index name="nt-resource_jcr-data" provider-name="lucene" kind="text" synchronous="true" node-type="nt:resource" columns="jcr:data(BINARY)"/>
                                          </indexes>
                          

                           

                          And I see this messages at startup:

                          12:11:16,435 DEBUG [org.modeshape.jcr.index.lucene.LuceneIndexProvider] (ServerService Thread Pool -- 68) Index provider 'lucene' is creating index in workspace 'default': nt-resource_jcr-name@lucene nodeType=nt:resource column=jcr:name(NAME) kind=VALUE sync=true workspaces=*
                          12:11:16,574 DEBUG [org.modeshape.jcr.index.lucene.LuceneIndexProvider] (ServerService Thread Pool -- 68) Index provider 'lucene' is creating index in workspace 'default': jcr_data@lucene nodeType=nt:resource column=jcr:data(BINARY) kind=TEXT sync=true workspaces=*
                          

                           

                          I have even removed the Lucene index files and re-indexed everything:

                          12:24:35,725 DEBUG [org.modeshape.jcr.index.lucene.LuceneIndexProvider] (modeshape-reindexing-6-thread-1) Enabled index 'nt-resource_jcr-name' from provider 'lucene' in workspace 'default' after reindexing has completed
                          

                           

                          But I still the same behavior as previously with regard to getExtractedText().

                          • 10. Re: Running text extractors synchronously
                            dalbani

                            And I see this strange (to me, at least) double occurences of jcr:data indexes:

                             

                                        Index [resource] <INDEX_SPECIFICATION=jcr_data, provider=lucene, cost~=100, cardinality~=233, selectivity~=0.17598188, constraints=[CONTAINS(resource.[jcr:data],'text')]>
                                        Index [resource] <INDEX_SPECIFICATION=nt-resource_jcr-data, provider=lucene, cost~=100, cardinality~=233, selectivity~=0.17598188, constraints=[CONTAINS(resource.[jcr:data],'text')]>
                            
                            12:46:28,481 TRACE [org.modeshape.jcr.query] (default task-8) Computed output columns for query 83: {
                            resource.[jcr:name]{STRING}
                            }
                            12:46:28,481 TRACE [org.modeshape.jcr.query] (default task-8) Start executing query 83
                            12:46:28,487 TRACE [org.modeshape.jcr.query] (default task-8) The execution function for 83: (limit 2 (filtered width=1 org.modeshape.jcr.query.engine.ScanningQueryEngine$16@64da4297 (filtered width=1 (filter resource.[jcr:primaryType] = 'nt:resource') (from-index jcr_data with [CONTAINS(resource.[jcr:data],'text')]))) )
                            12:46:28,489 TRACE [org.modeshape.jcr.query] (default task-8) Stopped executing query 83: 61.596 ms (plan=222.094 usec, optim=56.145 ms, resultform=5.228 ms)
                            

                             

                            Yet, I do have a single index for this column?!

                            • 11. Re: Running text extractors synchronously
                              hchiorean

                              Can you post the full query plan ?

                              • 12. Re: Running text extractors synchronously
                                hchiorean

                                actually, I had a look at one of our internal test cases and even if the index is used as expected, the `getText` method is called after getting the node keys from the index.

                                 

                                Since the code in that area is highly complex, I'm not sure if it can be improved in this case or not (i.e. to not call `getText` during each row iteration), but feel free to open an enhancement JIRA and hopefully someone will be able to investigate it as some point

                                • 13. Re: Running text extractors synchronously
                                  dalbani

                                  Here's the full query plan:

                                   

                                  12:46:28,476 TRACE [org.modeshape.jcr.query] (default task-8) Computed optimized query plan for query 83:
                                  Limit <LIMIT_COUNT=2>
                                    Access [resource]
                                      Project [resource] <PROJECT_COLUMNS=[resource.[jcr:name]], PROJECT_COLUMN_TYPES=[STRING]>
                                        Select [resource] <SELECT_CRITERIA=CONTAINS(resource.[jcr:data],'text')>
                                          Select [resource] <SELECT_CRITERIA=resource.[jcr:primaryType] = 'nt:resource'>
                                            Source [resource] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=resource, SOURCE_COLUMNS=[jcr:isCheckedOut(BOOLEAN), jcr:defaultPrimaryType(STRING), mode:uri(STRING), zip:crc(LONG), file:ghost(BOOLEAN), zip:isUnixSymlink(BOOLEAN), mode:lockingSession(STRING), jcr:etag(STRING), jcr:created(DATE), zip:centralDirectoryExtra(BINARY), jcr:sameNameSiblings(BOOLEAN), jcr:encoding(STRING), zip:centralDirectoryLength(LONG), jcr:workspace(STRING), ns1:alternativeName(STRING), jcr:activityTitle(STRING), jcr:mergeFailed(STRING), jcr:frozenMixinTypes(STRING), jcr:isQueryOrderable(BOOLEAN), ns1:uploadDate(DATE), jcr:lastModifiedBy(STRING), ns1:username(STRING), ns1:title(STRING), jcr:childVersionHistory(STRING), ns1:closingDate(DATE), zip:lastModifiedTime(DATE), mode:synchronous(BOOLEAN), jcr:repository(STRING), mode:derivedAt(DATE), ns1:name(STRING), jcr:lastModified(DATE), zip:localFileDataData(BINARY), jcr:successors(STRING), jcr:availableQueryOperators(STRING), ns1:url(STRING), mode:derivedFrom(STRING), jcr:primaryItemName(STRING), ns1:dateEnded(DATE), mode:isSessionScoped(BOOLEAN), jcr:frozenUuid(STRING), zip:localFileDataLength(LONG), jcr:isFullTextSearchable(BOOLEAN), zip:usesUTF8ForNames(BOOLEAN), jcr:defaultValues(STRING), mode:sharedUuid(STRING), ns1:software(STRING), jcr:predecessors(STRING), ns1:timezone(STRING), jcr:content(STRING), zip:creationTime(DATE), zip:usesStrongEncryption(BOOLEAN), file:hash(STRING), zip:usesEncryption(BOOLEAN), zip:headerId(LONG), mode:projectedNodeKey(STRING), jcr:language(STRING), jcr:valueConstraints(STRING), mode:sha1(STRING), jcr:activity(STRING), ns1:identifier(STRING), ns1:dateStarted(DATE), jcr:requiredPrimaryTypes(STRING), jcr:baseVersion(STRING), zip:method(LONG), ns1:publicationDate(DATE), jcr:requiredType(STRING), jcr:autoCreated(BOOLEAN), jcr:isAbstract(BOOLEAN), mode:externalNodeKey(STRING), jcr:hasOrderableChildNodes(BOOLEAN), mode:lockToken(STRING), mode:kind(STRING), ns1:successful(BOOLEAN), mode:alias(STRING), mode:columnTypeName(STRING), jcr:text(STRING), jcr:mimeType(STRING), zip:lastAccessTime(DATE), mode:workspace(STRING), ns1:enabled(BOOLEAN), jcr:data(BINARY), jcr:port(STRING), ns1:description(STRING), jcr:protected(BOOLEAN), jcr:primaryType(STRING), zip:comment(STRING), zip:size(LONG), mode:expirationDate(DATE), jcr:id(STRING), privileges(STRING), zip:centralDirectoryData(BINARY), jcr:host(STRING), zip:isDirectory(BOOLEAN), jcr:lifecyclePolicy(STRING), jcr:versionHistory(STRING), jcr:mixinTypes(STRING), jcr:isQueryable(BOOLEAN), jcr:mandatory(BOOLEAN), jcr:lockIsDeep(BOOLEAN), zip:usesDataDescriptor(BOOLEAN), ns1:log(STRING), zip:unixMode(LONG), jcr:currentLifecycleState(STRING), jcr:nodeTypeName(STRING), jcr:isMixin(BOOLEAN), zip:internalAttributes(LONG), name(STRING), jcr:messageId(STRING), jcr:uuid(STRING), ns1:password(STRING), jcr:multiple(BOOLEAN), zip:externalAttributes(LONG), jcr:title(STRING), jcr:onParentVersion(STRING), mode:nodeTypeName(STRING), jcr:supertypes(STRING), jcr:protocol(STRING), mode:propertyName(STRING), ns1:reference(STRING), jcr:frozenPrimaryType(STRING), zip:platform(LONG), jcr:lockOwner(STRING), jcr:createdBy(STRING), jcr:isDeep(BOOLEAN), zip:localFileDataExtra(BINARY), jcr:hold(STRING), jcr:configuration(STRING), zip:rawName(BINARY), jcr:description(STRING), jcr:statement(STRING), mode:isHeldBySession(BOOLEAN), mode:workspaces(STRING), jcr:copiedFrom(STRING), jcr:retentionPolicy(STRING), jcr:root(STRING), jcr:versionableUuid(STRING), zip:compressedSize(LONG), mode:generated(BOOLEAN), mode:localName(STRING), mode:id(STRING), mode:depth(LONG), jcr:score(DOUBLE), jcr:path(STRING), jcr:name(STRING)]>
                                              Index [resource] <INDEX_SPECIFICATION=jcr_data, provider=lucene, cost~=100, cardinality~=233, selectivity~=0.17598188, constraints=[CONTAINS(resource.[jcr:data],'text')]>
                                              Index [resource] <INDEX_SPECIFICATION=nt-resource_jcr-data, provider=lucene, cost~=100, cardinality~=233, selectivity~=0.17598188, constraints=[CONTAINS(resource.[jcr:data],'text')]>
                                  

                                   

                                  Why is there such a long list of properties by the way?

                                  • 14. Re: Running text extractors synchronously
                                    dalbani

                                    And I see this strange (to me, at least) double occurences of jcr:data indexes.

                                    In fact, that only came from the fact than other indexes I defined in the past were still recorderd in "/jcr:system/mode:indexes".

                                    But now I get yet another different behavior: the Lucene indexes don't get automatically built at startup of ModeShape where there are not present?!

                                    I'm pretty sure it did until I removed all indexes via the IndexManager and started from scratch.

                                    1 2 Previous Next