12 Replies Latest reply on Feb 14, 2013 9:49 AM by rhauch

    Search in ModeShape

    patno

      I'm evaluating ModeShape as a potential replacement for an existing non-JCR content repository, probably using federation to make existing content available using JCR. So I have a bunch of questions about ModeShape, of which this is one...

       

      As I understand search in Modeshape, it uses Hibernate search, but there doesn't seem to be a way to customize the indexing or search quite as much as you could in a "normal" implementation of Hibernate search - I see no way to customize the analyzer, for instance, except by using a default analyzer.  It also seems that, because you can't modify types, there is no way change the way something is indexed except by creating a new type and modifying all content that needs reindexing.

       

      If these two assumptions are true, we would need another search engine (probably Solr, since we already use that) for our search needs. We also need to make sure all changes to content are reflected in the index in a timely fashion, i.e. within seconds. I assume the best way to do that would be to somehow listen to events occuring in ModeShape, but I'd be happy to see other suggestions.

       

      Is there a way to get changes in ModeShape to an external process that could update the index? I guess JCR observation probably isn't transaction safe, i.e. I'm not guaranteed to see a change if the system happens to go down.  The EventJournal stuff looks like a great fit for what I need (in fact we already have an indexer that works using a similar mechanism), but that's not implemented in ModeShape - are there any plans to add it? Are there any other ways I can listen for changes, where I'm guaranteed to see the change? I don't need it to be transactional - it would be okay to see changes that are then rolled back.

        • 1. Re: Search in ModeShape
          rhauch

          Hi, Patrik.

           

          Can you describe how you might expect to deploy ModeShape? The two major ways are:

          • via JBoss AS7, where AS7 manages ModeShape's engine, or
          • as normal JARs (either within a Java SE application or packaged into a WAR/EAR files in a web/application server), where your system starts and controls ModeShape's engine

           

          One reason I ask is that the configuration mechanism is very different. When installing ModeShape as a subsystem within JBoss AS7, configuration is done via the AS7 mechanism; when using ModeShape's engine directly, you use a different JSON configuration file for each repository.

           

          As I understand search in Modeshape, it uses Hibernate search, but there doesn't seem to be a way to customize the indexing or search quite as much as you could in a "normal" implementation of Hibernate search - I see no way to customize the analyzer, for instance, except by using a default analyzer.

          That is the case with 3.0 in AS7, since we simply haven't exposed the ability to configure a custom analyzer via the AS7 configuration mechanism. That's very easy to do.

           

          However, if you're using the ModeShape engine and JSON configuration files, you can specify a custom analyzer (see one of our test configurations).

           

          BTW, we're using Hibernate Search internally as a custering-aware layer above Lucene. The cluster-related parts of Hibernate Search (e.g., where the indexes are stored, which backend is used, etc.) and the language-specific aspects (e.g., analyzer, etc.) are exposed via our configuration. But how we use Hibernate Search (e.g., the "entities", how many indexes) is an implementation detail and is not exposed to clients, so the related parts of the Hibernate Search configuration are not exposed.

           

          It also seems that, because you can't modify types, there is no way change the way something is indexed except by creating a new type and modifying all content that needs reindexing.

           

          Can you explain what you mean by "can't modify types"? You can absolutely change the node types on any node, and this will cause the nodes to be reindexed. You can even dynamically add/change/remove node type definitions, although that doesn't cause reindexing since there is no need to do that - the indexes contain only the node information (including node type names), wherease the node types themsevles (e.g., the properties and supertypes) are used dynamically by our query system during query planning and processing.

           

          Consider an example. Let's say that we have a custom node type define, and we've already added content that uses that particular node type (and the content has already been indexed) yet also contains extra properties not defined by that node type (either because the content also uses other node types via mixins, or because the nodes allow "residual" properties). Now we want to change the node type to add a property definition for a property that the content already uses. Using the NodeTypeManager, we can update this node type definition, but then without changing any of the content or reindexing we can immediately query that node type to select the newly-added property. Any nodes that already contain this property will be evident in the results.

           

          (This is because all of a node's properties are included in the index when that node is indexed, regardless of whether a node type defines such a property. However, when the property definition is added to a node type, that effectively changes the SQL-like view for that node type by adding a "column" to that view, which of course is visible to any query executed from that point forward.)

           

          If these two assumptions are true, we would need another search engine (probably Solr, since we already use that) for our search needs.

           

          We knew that using Lucene for our search engine may not be appropriate for all users, so we designed to code to allow a choice of engines. However, this work is by no means complete, since we really just had the resources to foucs on a single implementation that happened to satisfy our query/search needs. And, all that code is in the bowels of ModeShape; not something that can be added in by client applications easily at the moment. Perhaps it's time to start considering adding support for other engines like Solr or Elastic Search.

           

          We also need to make sure all changes to content are reflected in the index in a timely fashion, i.e. within seconds. I assume the best way to do that would be to somehow listen to events occuring in ModeShape, but I'd be happy to see other suggestions.

           

          Is there a way to get changes in ModeShape to an external process that could update the index? I guess JCR observation probably isn't transaction safe, i.e. I'm not guaranteed to see a change if the system happens to go down.  The EventJournal stuff looks like a great fit for what I need (in fact we already have an indexer that works using a similar mechanism), but that's not implemented in ModeShape - are there any plans to add it? Are there any other ways I can listen for changes, where I'm guaranteed to see the change? I don't need it to be transactional - it would be okay to see changes that are then rolled back.

           

          It would be possible to add another engine on top of ModeShape, and use that for all searching functionality. You can even completely disabled ModeShape's internal index maintenance, though this would disable the ability to query for content using ModeShape's query languages.

           

          An external engine's indexes can be kept up-to-date with content changes using ModeShape events. Simple observation is surprisingly useful for this -- in fact, that's all we use internally. To use it, simply implement the EventListener interface. Each time ModeShape calls an event listener, the supplied EventIterator will contain all of the changes (that the listener asked for) in a single transaction.

           

          As for the EventJournal mechanism, it is conceptually just a (limited size) queue of events. This would be pretty easy to accomplish for a subset of the possible configurations, but becomes much more work to manage this in a generic way for clustered systems (e.g., disk storage vs memory, consistency of journal duration, etc.). If you think it's important, please add a feature request via our JIRA.

           

          While implementing a general-purpose journal suitable to lots of configurations and use cases is not straightforward, implementing something like it but tailored to your own needs would be trivial. Simply add a listener that accumlates the events into a queue, and then process that queue as needed. Note that our events are immutable, which means they can be shared/accumulated and will never change.

           

          I hope this helps. As always, please continue the thread if you have more questions.

           

          Best regards,

           

          Randall

          • 2. Re: Search in ModeShape
            patno

            Can you describe how you might expect to deploy ModeShape? The two major ways are:

            • via JBoss AS7, where AS7 manages ModeShape's engine, or
            • as normal JARs (either within a Java SE application or packaged into a WAR/EAR files in a web/application server), where your system starts and controls ModeShape's engine

            One reason I ask is that the configuration mechanism is very different. When installing ModeShape as a subsystem within JBoss AS7, configuration is done via the AS7 mechanism; when using ModeShape's engine directly, you use a different JSON configuration file for each repository.

            We would likely deploy it using normal JARs, not using AS7.

            However, if you're using the ModeShape engine and JSON configuration files, you can specify a custom analyzer (see one of our test configurations).

            I wasn't clear - yes, you can use a custom analyzer, but we need to be able to do this per field. There are also other useful Solr features which I don't think I could get to in ModeShape, like being able to copy certain fields into other fields (so e.g. 'title' is indexed both on its own and into a generic 'text' field which contains all text), or boosting fields.

            BTW, we're using Hibernate Search internally as a custering-aware layer above Lucene. The cluster-related parts of Hibernate Search (e.g., where the indexes are stored, which backend is used, etc.) and the language-specific aspects (e.g., analyzer, etc.) are exposed via our configuration. But how we use Hibernate Search (e.g., the "entities", how many indexes) is an implementation detail and is not exposed to clients, so the related parts of the Hibernate Search configuration are not exposed.

            Does this also mean there is no way to do Lucene queries against the index, only JCR full text queries? Actually I guess since the index is designed to work for JCR searches it might not be suitable for our use anyway.

             


            Can you explain what you mean by "can't modify types"? You can absolutely change the node types on any node, and this will cause the nodes to be reindexed. You can even dynamically add/change/remove node type definitions, although that doesn't cause reindexing since there is no need to do that - the indexes contain only the node information (including node type names), wherease the node types themsevles (e.g., the properties and supertypes) are used dynamically by our query system during query planning and processing.

            In the Javadoc for JcrModeTypeManager, all the node type registration methods say

             

            @throws UnsupportedRepositoryOperationException if

            Unknown macro: {@code allowUpdate}

            is true; ModeShape does not allow updating node                    types at this time.

            I took that to mean you can't ever update node types - I guess maybe it's from some earlier time when it wasn't possible?

            We knew that using Lucene for our search engine may not be appropriate for all users, so we designed to code to allow a choice of engines. However, this work is by no means complete, since we really just had the resources to foucs on a single implementation that happened to satisfy our query/search needs. And, all that code is in the bowels of ModeShape; not something that can be added in by client applications easily at the moment. Perhaps it's time to start considering adding support for other engines like Solr or Elastic Search.

            I think a bigger deal than search engine choice is the ability to customize search and indexing (like I mentioned above, copying and boosting are very useful, as well as per-field configurations). At least for us that's the bigger issue, although we prefer Solr because we're familiar with it and it's easy to use from outside Java.

            An external engine's indexes can be kept up-to-date with content changes using ModeShape events. Simple observation is surprisingly useful for this – in fact, that's all we use internally. To use it, simply implement the EventListener interface. Each time ModeShape calls an event listener, the supplied EventIterator will contain all of the changes (that the listener asked for) in a single transaction.

            What worries me about event listeners is that (according to the JCR spec) they happen asynchronously, which to me sounds like if the node the listener is running on fails, we might miss updates. If the event listener was part of the transaction, it would work fine (and I could build my own event journal). With event journalling, though, I could pick up where I left off in case of failure, so I'd be guaranteed not to miss any events.

            • 3. Re: Search in ModeShape
              patno

              {blockquote} An external engine's indexes can be kept up-to-date with content changes using ModeShape events. Simple observation is surprisingly useful for this – in fact, that's all we use internally. To use it, simply implement the EventListener interface. Each time ModeShape calls an event listener, the supplied EventIterator will contain all of the changes (that the listener asked for) in a single transaction. {blockquote}  I took a look at the implementation of indexing. It looks like you're using an internal transaction-aware event listening mechanism (SessionEnvironment.Monitor). If there was a way to hook into that mechanism, that would solve my problem. Obviously that wouldn't be JCR compliant, but I'd be happy to use a ModeShape-specific API for this just like for federation if it makes life easier.  Event journalling could also work, but the fact that it uses timestamps (which I had forgotten) makes me less of a fan of it than I thought I was - that's going to make it even more of a pain to implement in a distributed system like ModeShape, and we'll still have to deal with duplicate events after a restart.

              • 4. Re: Search in ModeShape
                rhauch

                Yes, that mechansim is transaction aware, mostly because of how the JCR specification stipuates how calling Session.save() within and outside of a transaction behaves. (When Session.save() is called outside of a transaction, ModeShape creates a transaction, persists the changes, commits the transaction, and forwards all of the accumulated changes. When Session.save() is called within a transaction, ModeShape simply captures what needs to be persisted and watches the transaction so that when the transaction commit starts, we persist the information and then forward the accumulated changes.)

                 

                BTW, those indexing changes are indeed sent to our internal query engine (which uses HIbernate Search and Lucene), and all of the changes of a transaction are sent at once. However, all of the events describing the changes in a transaction are also fired at once; this is what EventIterator iterates over.

                 

                It would likely be possible to change ModeShape so that you could implement QueryIndexing interface directly and ModeShape would notify an instance as needed. This object would also be called when ModeShape is asked to reindex all or some of the content. But that capability wouldn't be available until 3.1. If you think this is useful, please file an enhancment in our JIRA and describe how you'd like to configure it (e.g., a classname that we instantiate, or does an existing object need to be injected in).

                • 5. Re: Search in ModeShape
                  orsonek

                  Randall,

                   

                  you wrote that changing node type in repository works ok, but it looks like it doesn't.

                   

                  I created type in repository:

                   

                  [my:user] > mix:created, mix:lastModified

                    - my:workspace (STRING) COPY

                   

                  I use NodeTypeManager.registerNodeTypes(InputStream) to register CND file and it worked as I expected. After some tests, I added some properties and now the node definition looks like:

                   

                  [my:user] > mix:created, mix:lastModified

                    - my:workspace (STRING) COPY

                    - my:lastLoggedIn (DATE) COPY

                    - my:locale (STRING) = 'en_EN' autocreated mandatory COPY < 'en_EN'

                   

                  I registered CND file again and I tried to set my:lastLoggedIn property, but it failed with error:

                   

                  Cannot find a definition for the property named 'my:lastLoggedIn' on the node at '/sec/cane' with primary type 'my:user' and mixin types: []

                   

                  I deleted all my:user nodes, added new ones and:

                   

                  - there is no autocreated my:locale property in nodes

                  - I can set my:workspace property

                  - I CAN'T set my:locale or my:lastLoggedIn properties - still got the same error.

                   

                  I did one more test adding completely new type and it works as expected, type is visible and nodes can be created. Both types are of course in the same CND file.

                   

                  Configuration: ModeShape 3.0.1 as  JBoss AS 7.1.1.Final subsystem, repositories configured from jboss-cli. Checked few times and no solution found, repository were restarted many times of course when doing it on-the-fly failed, but it didn't helped.

                   

                  Besides all - great job!

                   

                  Best regards,

                  Bart

                  • 6. Re: Search in ModeShape
                    rhauch

                    I use NodeTypeManager.registerNodeTypes(InputStream) to register CND file and it worked as I expected.

                    The method doesn't exactly match this signatures; you're missing the second boolean parameter that defines whether updates to node definitions should be allowed.

                     

                     

                    I registered CND file again...

                    Did you pass 'true' for the second parameter? If not, then give this a try.

                     

                    If you did pass 'true' for the second parameter of 'registerNodeTypes', then there's likely a problem. If you haven't already, please try to create a new Session after you've (re)registered the updated node type. This shouldn't be a factor, but if the new Session does work then (a) this is a workaround, and (b) the problem will likely be in the caching of the old node type definition within the Session's nodes.

                     

                    The other thing to do is to verify that the NodeType was indeed updated in the NodeTypeManager.

                    • 7. Re: Search in ModeShape
                      orsonek

                      Did you pass 'true' for the second parameter? If not, then give this a try.

                      Yes, I tried it with false and true set. It doesn't work. When false is passed then exception is thrown, saying that node types are already registered.

                       

                      I'm trying to set up environment in which restars will occure rarely, so in fact I created simple shell to execute commands inside JEE environment, so you may consider that new session for tests has been created right after previous one, when CND has been applied to current node types, but in different container TX. Call hierarchy is: CLI -> remote EJB -> CDI bean -> ModeShape (with JCR API). Another story is that EJB and CDI beans sees user's credentials, but not ModeShape. (I checked debug output and there is JAAS configured on ther repo, I can succesfully connect to it through modeshape-rest client, passing credentials from configured security domain - not default one.)

                      • 8. Re: Search in ModeShape
                        rhauch

                        Please log a bug regarding the node type registration problem, and add as much detail as you can.

                        I'm trying to set up environment in which restars will occure rarely, so in fact I created simple shell to execute commands inside JEE environment, so you may consider that new session for tests has been created right after previous one, when CND has been applied to current node types, but in different container TX. Call hierarchy is: CLI -> remote EJB -> CDI bean -> ModeShape (with JCR API).

                        And I presume the transaction did commit. The node type registration is a workspace-level method, which means that even though you access it via an authenticated session, the registration itself (as with all other workspace-level methods) should be part of the transaction and will take effect only upon transaction commit (per Section 21.3 of the JSR-283 specification). If the transaction did commit, then this could be another bug that happens only with JTA transactions and should be logged as a bug. A simple test case that shows this failure would be very helpful so that we know exactly what the conditions are.

                        Another story is that EJB and CDI beans sees user's credentials, but not ModeShape. (I checked debug output and there is JAAS configured on ther repo, I can succesfully connect to it through modeshape-rest client, passing credentials from configured security domain - not default one.)

                        This sounds like an issue with the JAAS configuration.

                        • 9. Re: Search in ModeShape
                          orsonek

                          And I presume the transaction did commit.

                          Yes, transaction has been properly commited while registering CND, tested with FULL_XA and NON_XA setup on cache for repo (caches are stored in filesystem). I will create an issue tomorrow.

                           

                          This sounds like an issue with the JAAS configuration.

                          Can you confirm that it works when JCR API is called from EJB directly, when no CDI bean is used in the middle?

                          • 10. Re: Search in ModeShape
                            orsonek

                            Hmmm, yesterday modifying registered nodes type did't work, today it works. I wonder if such behaviour may be related to Eclipse and it's JBoss management - the only thing that changed is that I turned off computer in the evening. And yesterday I worked without single Eclipse restart and all JBoss restarts has been managed by JBoss Tools.

                             

                            So no bug here for now, but as I am really sure what I've done yesterday i will try to reproduce it, possibly in january.

                             

                            Principals are still not pushed to ModeShape - repo is configured as follows:

                             

                            ..., "security" : { "jaas" : { "policyName" : "quark" } , "anonymous" : { "username" : "<anonymous>" , "useOnFailedLogin" : false } , "providers" : [ { "classname" : "servlet" , "name" : "Authenticator that uses the Servlet context" } ] } , ....

                             

                            I see "servlet" as a provider - what should be here to support JAAS authentication when no web tier is used and client connects to remote-ejb directly?


                              "jcr:lastModified": "2012-12-21T14:37:04.298+01:00",

                              "jcr:lastModifiedBy": "",

                             

                            As you can see there is no user in autocreated field when repo.login( "testworkspace") is used. Passing simple credentials solves the problem, but it's unacceptable solution. As I mentioned above modeshape-rest uses configured 'quark' security domain, I updated the record through it right now and:

                             

                              ...

                              "jcr:lastModified": "2012-12-21T15:19:28.349+01:00", 

                              "jcr:lastModifiedBy": "cane",

                               ...

                            • 11. Re: Search in ModeShape
                              orsonek

                              When testing I created repository with it's caches switched to FULL_XA mode, what caused problems under Modeshape 3.0.1.Final, then I modified them, but the repo wasn't recreated, so it was probably the reason of errors. Modeshape 3.1.1.Final creates correct repositories in such configuration, so problem is over. I will dig into JAAS problem soon, currently i believe it's not JAAS configuration issue, as its only Modeshape which can't access authorization information (didn't test with 3.1.1.F yet).

                              • 12. Re: Search in ModeShape
                                rhauch

                                Great to hear that it's working for you. Let us know if you think you found a bug in ModeShape.