10 Replies Latest reply on Sep 24, 2011 1:35 PM by penkween

    How to update the Query index without restarting the repository ?

    penkween

      Hi,

       

      For the Filesystem Connector used within App Server, let say the repository already has the following directory structure:

       

      /

      /folder1

      /folder2

       

            When the JcrEgine and the repository is started and the index is built and when we do Query ("SELECT * FROM [nt:base]"), we will see the three nodes listed as expected. What happen when a new file eg.log1.txt generated or added into the repository by Operationg System  (Not via Modeshape Node.add() ) and now the repository structure becomes the following:

       

      /

      /folder1

      /folder2

      /log1.txt

       

           When we do Query, the index will not include the "log1.txt" unless Repository is restarted so that the index is rebuilt (assuming we use rebuildQueryIndexOnStartup="always") . Without restarting the repository, how do we add such new entry ie: "/log1.txt"  to the index so that it is addressed by the Query ? Is it possible that we can access to the underlying "SearchEngineIndexer.indexSubgraph()" to update the index and how the coding will look like ?

       

      Thank.

        • 1. Re: How to update the Query index without restarting the repository ?
          rhauch

          What happen when a new file eg.log1.txt generated or added into the repository by Operationg System  (Not via Modeshape Node.add() ) and now the repository structure becomes the following:

           

          /

          /folder1

          /folder2

          /log1.txt

           

          The file system connector does not currently support monitoring the file system for changes and translating these changes into ModeShape events, and so the indexes don't know they need to be changed. We don't know of an easy way to do this in JDK6 without polling or use of a native library, but if you do please let us know. We'd definitely like to see some suggestions. (BTW, JDK7 offers new support for this.)

           

          There are 3 methods on JcrSession to reindex the content, and so your application can force a reindexing (at a particular subgraph if known). However, I just noticed that although the methods are public, the JcrSession class is not public and this hides the methods. I've logged MODE-1269 as a critical bug to make these public in the upcoming 2.6.0.FINAL release (see the issue for what we plan to change, and feel free to comment on whether this would suit your needs).

          • 2. Re: How to update the Query index without restarting the repository ?
            penkween

            Hi Randall,

             

            Tested with the latest Modeshape source with MODE-1269 updates and having few issues as below:

             

             

            Issue 1 : Should we use Session or JcrSession ?

            ======================================================================================

             

            A.) If using javax.jcr.Session, I can't access to that 4 functions reindex(),reindexAsync().... via the below code:

             

            javax.jcr.Session session = repository.login();       

            session.getWorkspace().?  <================ Can't access to reindex(),reindexAsync() etc..

             

             

            B.) But then if using javax.jcr.JcrSession, I can access to that 4 functions provided my app package is within same package as JcrSession class which is org.modeshape.jcr (Since JcrSession class is protected) via the below code:

             

            javax.jcr.Session session = repository.login();       

            session.getWorkspace(). reIndex() <=========  Can access now.

             

             

             

            Issue 2 : Out of that 4  functions, only reindex() working OK so far.

            ======================================================================================

             

            Tested with quite a number of scenarios using a Standalone App  with config option as below:

             

            <mode:option jcr:name="rebuildQueryIndexOnStartup" mode:value="ifMissing" />

            <mode:option jcr:name="queryIndexDirectory" mode:value="./repository/indexes" />

             

             

            Let say we have the following File system structure:

            /

            /folder1

            /folder1/dir1

            /folder1/file1.txt

             

             

            Below are the Test Result using that 4 reindexing functions:

             

            reindex()                                             //The reindexing is OK for All Nodes

            reindexAsync()                                    //The reindexing is OK for All Nodes BUT my App's process running forever and hanging, won't quit.

            reindexAsync(java.lang.String, int)     //The reindexing is OK for All Nodes BUT my App's process running forever and hanging, won't quit.

             

            reindex(String path,int depth )            //The reindexing is OK Only if started from Root Node ie: reindex("/", 2)

             

                                                                       //The reindexing will FAIL if started from any Child Node ie: reindex("/folder1", 2) .  Let say we change "dir1" to "dir2", run reindex("/folder1,2), then the Query won't return the new "/folder1/dir2", only the old "/folder1/dir1" show up.

             

                                                                        //Furthermore, if we use reindex("/folder1",3) where the "3" exceed the actual depth, then our new "dir2" will appear as duplicate index together with the old "dir1" which mean, the Query return us both  "/folder1/dir1" and "/folder1/dir2" .This duplicate index behavior also happen to files if we somehow change "file1.txt" to "file2.txt" then we will get both of it returned by the Query.

             

             

             

            * All the above issues are not manifested with the Unit @Test inside JcrRepositoryTest.java could be due to :


            - The JcrRepositoryTest is in the same package as JcrSession class ie: org.modeshape.jcr

            - All the @Test only reindexing from Root Node ie:reindex("/",2)

            - All the @Test might be storing Indexes in MEMORY (Not persisted to filesystem) where the Indexes will be rebuilt by default for every run.

             


            • 3. Re: How to update the Query index without restarting the repository ?
              rhauch

              Issue 1 : Should we use Session or JcrSession ?

              ======================================================================================

               

              A.) If using javax.jcr.Session, I can't access to that 4 functions reindex(),reindexAsync().... via the below code:

               

              javax.jcr.Session session = repository.login();       

              session.getWorkspace().?  <================ Can't access to reindex(),reindexAsync() etc..

               

               

              The methods are not defined on the JCR API, so you're going to have to cast to ModeShape-specific classes. Now, the new methods are defined as public methods on the new public interface "org.modeshape.jcr.api.Workspace", which extends "javax.jcr.Workspace" by adding these new methods. Note that the "org.modeshape.jcr.api" package (and its few subpackages) are defined as a public API for ModeShape - meaning we treat them as formal APIs, guaranteeing no adverse changes for the rest of the 2.x series. And although we do reserve the right to change the public API in a major release, we currently have no intentions of doing so.

               

              So, simply cast the javax.jcr.Workspace to 'org.modeshape.jcr.api.Workspace', and you'll see the public reindexing methods.

               

              Let's now talk about your second issue:

              reindex()                                             //The reindexing is OK for All Nodes

              reindexAsync()                                    //The reindexing is OK for All Nodes BUT my App's process running forever and hanging, won't quit.

              reindexAsync(java.lang.String, int)     //The reindexing is OK for All Nodes BUT my App's process running forever and hanging, won't quit.

               

              reindex(String path,int depth )            //The reindexing is OK Only if started from Root Node ie: reindex("/", 2)

               

                                                                         //The reindexing will FAIL if started from any Child Node ie: reindex("/folder1", 2) .  Let say we change "dir1" to "dir2", run reindex("/folder1,2), then the Query won't return the new "/folder1/dir2", only the old "/folder1/dir1" show up.

               

              How are you dealing with the Future<Boolean> returned from the async methods?

               

              If you need to block and wait for the indexing to complete, you can just call Future<Boolean>.get(). Note that if you call this right after the 'reindexAsync' invocation, you may as well just call the 'reindex' method. Typically, you'd call 'reindexAsync', get the future, pass the future off to some other thread (often via a queue or some other collection mechanism), and continue on in the thread. Meanwhile, some other thread might be periodically checking the future(s) and discarding it if Future.isDone() returns true or re-enqueuing them if not done. It also allows you to cancel the process, which should interrupt the indexing thread.

               

              I will add additional tests this morning to see if I can replicate the conditions you mention above.

              • 4. Re: How to update the Query index without restarting the repository ?
                penkween

                Hi Randall,

                 

                Thank for your guidance. The casting make it work. Regarding the Async methods, will further test on it could be due to my handling of Future. Just now I test again after using the casting, the reindex("/folder1",2) and reindex("/folder1",3) issues still happen.

                • 5. Re: How to update the Query index without restarting the repository ?
                  rhauch

                  BTW, I think the depth issue is related to the file system connector. Could you test with in-memory and let us know if the behavior is different?

                   

                  I should be able to look into this pretty soon.

                  • 6. Re: How to update the Query index without restarting the repository ?
                    penkween

                    Hi Randall,

                     

                               Have just tested with in-memory connector, both the reindex("/folder1",2)  and reindex("/folder1",3) seem work fine. In fact the reindexing manually using reindex() is not quite needed and not sure under what situation will be needed anyway because all updates will have to go through the modeshape session and its index will be updated upon session.save(). Same behavior obtained using Disk Connector, it work fine too.

                     

                               Wondering how to create the similiar situation using in-memory or disk connector to highlight the situation faced by Filesystem Connector where its repository is modified by external app.

                    • 7. Re: How to update the Query index without restarting the repository ?
                      rhauch

                      I added several new integration tests (see this pull-request), and was able to duplicate the reindexing problem using the file system connector. And with those tests I was able to debug the problem, which appears to be some incorrect math in the SearchEngineIndexer logic changed in MODE-1263. Therefore, I reopened that issue, fixed the problem, and verified the re-indexing is working and now properly respecting the depth.

                       

                      The new integration tests also verify the asynchronous indexing of the entire content using "reindexAsync()" and of a subgraph to some depth using "reindexAsync(path,depth)". Both work fine, so I'm going to close MODE-1263 again.

                       

                      If you get a chance, please try the very latest code in our 'master' branch, and let us know whether the behavior is now correct.

                      • 8. Re: How to update the Query index without restarting the repository ?
                        penkween

                        Hi Randall,

                         

                        Tested with the latest source with index logic changed in MODE-1263, still facing duplicate indexes issue. Let say the existing filesystem repository structure as below:

                        /

                        /folder1

                        /folder1/dir1

                         

                        1. First, Run -> reindex() and the Query return Ok as expected like above.

                        2. Then, Change folder "/folder1/dir1" to "/folder1/dir2" using windows explorer

                        3. Run -> reindex("/folder1",2)

                        4. Run -> "SELECT * FROM [nt:base]"

                         

                        Then I got "duplicate indexes" of both old "/folder1/dir1" and new /folder2/dir2" show up in the Query return as shown below, this behavior is shown in previous test with older code using reindex("/folder1", 3)

                         

                         

                        +---+-----------------+---------------+----------+-----------+----------------+------------+----------------------------------------+--------+

                        | # | jcr:primaryType | jcr:path      | jcr:name | jcr:score | mode:localName | mode:depth | location(nt:base) | Score(nt:base) |

                        +---+-----------------+---------------+----------+-----------+----------------+------------+----------------------------------------+-------+

                        | 1 | mode:root    | /                 |            | 1.0       |                | 0          | </ && [{http://www.modeshape.org/1.0}

                        | 2 | nt:folder       | /folder1/dir1 | dir1      | 1.0       | dir1          | 2          | /{}folder1/{}dir1         | 1.0            |

                        | 3 | nt:folder       | /folder1       | folder1  | 1.0       | folder1      | 1          | /{}folder1                 | 1.0            |

                        | 4 | nt:folder       | /folder1/dir2 | dir2      | 1.0       | dir2          | 2          | /{}folder1/{}dir2         | 1.0            |

                        +---+-----------------+---------------+----------+-----------+----------------+------------+---------------------------------------+---------+

                         

                         

                        By right, the old "/folder1/dir1" shouldn't show up in the Query.

                         

                        Thanks.

                        • 9. Re: How to update the Query index without restarting the repository ?
                          rhauch

                          Thanks again, Danny! I really appreciate you spending the time to put this new capability through its paces. In fact, your use case actually pointed out a small flaw in the way the way the indexes were being updated. So that's been fixed.

                           

                          But interestingly it also pointed out a complexity with the maximum depth parameter in the reindex methods. While the maximum depth is more optimal in some situations, it is also extremely easy to mess up and corrupt the indexes.

                           

                          Consider a scenario with the following directory structure:

                           

                          /

                          /folder1

                          /folder1/file.txt

                          /folder1/subfolder1

                          /folder1/subfolder1/file2.txt

                          /folder1/subfolder2

                           

                          This would correspond to the following node structure:

                           

                          /

                          /folder1

                          /folder1/file.txt

                          /folder1/file.txt/jcr:content

                          /folder1/subfolder1

                          /folder1/subfolder1/file2.txt

                          /folder1/subfolder1/file2.txt/jcr:content

                          /folder1/subfolder2

                           

                          Now, imagine that "subfolder1" is renamed to "subfolder3". Now if the content at "/folder1" were indexed to the maximum depth, then all would be fine. However, if the content at "/folder1" were reindexed to a depth of 2, then the "subfolder1" would be properly removed from the indexes and "subfolder3" would properly appear. However, because these subfolders are at the maximum depth, their children (e.g., "file2.txt") are not and are therefore still left in the indexes as a child of "subfolder1" rather than appearing in the correct location of "subfolder3". Thus the indexes would contain this structure:

                           

                          /

                          /folder1

                          /folder1/file.txt

                          /folder1/file.txt/jcr:content

                          /folder1/subfolder1/file2.txt

                          /folder1/subfolder1/file2.txt/jcr:content

                          /folder1/subfolder2

                          /folder1/subfolder3

                           

                          (For reasons that are very difficult to explain, our algorithm doesn't know "subfolder1" was removed and therefore all nodes below it should be removed. It merely knows what the new content looks like.)

                           

                          In short, the maximum depth parameter will likely cause far more issues and problems that will be difficult to track down, and is more optimum but no better than indexing to the fullest depth only in a few cases. Since keeping the depth parameter is extremely risky with very little benefit, I've removed the depth parameter from the public methods (which have not yet appeared in a release).

                           

                          We may indeed discover that the depth is very useful, and if that's the case we can always add additional methods with a depth parameter. Adding methods to a public API is easy; removing methods is nearly impossible.

                           

                          I've merged in additional changes, so now the only reindexing methods on org.modeshape.jcr.api.Workspace are as follows:

                           

                          reindex();

                          reindexAsync();

                          reindex( String path);

                          reindexAsync( String path);

                           

                          As always, please let us know if you find any problems with these methods!

                           

                          Best regards,

                           

                          Randall

                          • 10. Re: How to update the Query index without restarting the repository ?
                            penkween

                            Hi Randall,

                             

                                  I think I know what you mean for the "Max Depth" issue since the "Depth" here is simply refering to the node depth and not "nt:folder" depth.

                             

                                  Without the "Max Depth", the immediate concern is whether we will be facing performance issue especially if the changes happen near the node root and  the repository tree is deep. Without the "Max Depth", I guess now even if we make sure all updates to the repository is via Modeshape api eg.node.addNode() will also take longer time to rebuild the index. For write intensive app, could be very challenging unless we organize the repository tree as flat as posibble.

                             

                             

                             

                                  Thanks.