8 Replies Latest reply on Aug 12, 2015 10:49 AM by ma6rl

    How to keep cluster indexes synchronized

    folch

      Hi,

       

      We have a cluster with 2 nodes (Modeshape 3.8.0) and Infinispan is configured with DB persistence (Oracle) and each node has his own index (stored in the file system)

      It's working fine but we've realized that the indexes are not synchronized when one of the nodes is down for a while and is restarted after some minutes.

      If I'm not wrong the default behaviour about re-indexing at startup (rebuildOnStartup) is: "if_missing", and the different options we have are:

      • if_missing": Only rebuilds the indexes if the indexes files are not present.
      • "always": Always re-builds the indexes at startup. This means that old indexes are removed and we re-build everthing from scratch
      • never": never is never
      • "fail_if_missing": don't really know..

       

      The main issue is how to proceed in case a cluster node is down for a period of time.

      If we use "if_missing" we have to manually delete all indexes to ensure the new content added during the shutdown is visible. If we don't do it the content is not indexed and new content is only available from the cluster node where it was added.

      If we remove the indexes manually the index is rebuild. However, if we have around 50k-60k of nodes it could take a lot of time (more than one hour)

       

      If we use "always" we are deleting and rebuilding the index in every restart. In case of high volumen of data it could take couple of hours.

       

      What's the best approach to manage indexes in a cluster setup when we have large amount of data?

       

      Thanks in advance.

        • 1. Re: How to keep cluster indexes synchronized
          hchiorean

          The other option in this case (apart from "always") is to manually (via your own code) trigger a reindexing for each affected workspace, using the ModeShape  API: workspace.reindex(...)

          You can also investigate the option of configuring Hibernate Search (aka. the indexing) to be clustered via a durable JMS queue. It's not something we really tested, but we support it via Hibernate Search. You can look at quickstart/standalone-modeshape-master-jms.xml at 3.x · ModeShape/quickstart · GitHub for an example of how such a clustering config would look in JBoss AS.

          • 2. Re: How to keep cluster indexes synchronized
            wesssel

            A not very friendly solution, but that ensures consistency is copying the indexes from the live machine to the down machine. However, if during the copy action nodes are added these won't be indexed ... :-)

            • 3. Re: How to keep cluster indexes synchronized
              mkotsur

              Are there any new options in Modeshape 4.3 or 4.4 with regards to this question?

              • 4. Re: How to keep cluster indexes synchronized
                hchiorean

                ModeShape 4 has an optional feature called Journaling (Journaling - ModeShape 4 - Project Documentation Editor). This essentially means storing locally (for each node in a cluster) all the changes that took place across that cluster (i.e. the global changes) since the node became alive. When a node leaves the cluster and then rejoins, it attempts to reconcile the data with other members from the cluster - essentially it tries to get the data that it missed while it was down.

                 

                This is not tied yet with indexing, but we do have on the roadmap [MODE-1903] Rebuild indexes from a point in time - JBoss Issue Tracker which presumably would keep indexes up to date if journaling is enabled.

                • 5. Re: How to keep cluster indexes synchronized
                  ma6rl

                  hchiorean What are the current options in Modeshape 4 (4.3.0.Final) for managing indexes across a cluster of nodes?

                   

                  We plan on deploying our application across a cluster of instances running on AWS EC2. The number of active instances will be managed via the AWS Auto Scaler which will create and destroy EC2 instances with our application as needed. Each instance in the cluster sees the same set of nodes which are stored in an Infinispan replicated cache backed by a shared JDBC store. We use cache eviction to only keep a subset of JCR nodes in memory while the JDBC store acts as the truth source for all JCR nodes.

                   

                  Give this we need to support the following:

                   

                  - When a node is added to application that requires indexing the indexes are created on each active instance, not just the instance that created the node.

                  - When a new instance is started it has the same set of existing indexes that the other nodes have as the instance will need to service the same JCR queries as the other active nodes.

                   

                  The application can handle the indexes being inconsistent across the instances for a window of time (the shorter the better of course) but eventual consistency is required.

                   

                  Given these requirements what our are current options using Modeshape 4.3.0?

                   

                  How does Modeshape build the indexes for new instances added to the cluster? I no longer see the options that were in 3.x to rebuild the indexes on startup?

                  1 of 1 people found this helpful
                  • 6. Re: How to keep cluster indexes synchronized
                    hchiorean

                    - When a node is added to application that requires indexing the indexes are created on each active instance, not just the instance that created the node.

                    if the cluster nodes are live and see each other via JGroups, when adding a node a remote event is sent across the cluster to which all cluster nodes should react and update their local indexes to reflect the change. In other words, regardless of the originating node, as long as the cluster is formed correctly, all node operations are broadcasted across the cluster in order to keep local indexes consistent across all nodes.

                    - When a new instance is started it has the same set of existing indexes that the other nodes have as the instance will need to service the same JCR queries as the other active nodes.

                    This is a bit more tricky: first and foremost you need to keep in mind that indexes in ModeShape 4 are just a mechanism for optimizing queries. They are not mandatory for queries to work since the default behavior for queries (in the absence of indexes) is to scan the entire repository data (you can read more here: https://docs.jboss.org/author/display/MODE40/Query+and+search).

                    Now, if you enable the local index provider & local indexes (which is the only option atm. in ModeShape 4) there are 2 different possibilities:

                    1. you bring up a new cluster node for the first time, for which there is no previous data. In this case, the new cluster node will start reindexing asynchronously all the data from the repository. You have the option via the ModeShape API to query the status of the re-indexing - see [MODE-2432] Enhance IndexManager SPI to offer information about the re-index status of all (enabled) indexes - JBoss Iss…. The end result in this case, is that after a certain amount of time, the indexes on the new cluster node will be synced up with the rest of the cluster.
                    2. you bring up a new cluster node which already has data stored for a given index. This may be either up-to-date with the rest of the cluster, or completely out-of-sync - in case of a crash. In this case, no re-indexing is performed explicitly out-of-the-box and also there is no automatic mechanism for bringing the indexes up-to-date. In this case you have to manually use the workspace API and trigger a reindexing (modeshape/Workspace.java at modeshape-4.3.0.Final · ModeShape/modeshape · GitHub), in order to bring the indexes up-to-date. My previous comment: Re: How to keep cluster indexes synchronized mentions a feature enhancement that we have open which once implemented, could automate this process with the help of the journal.

                    How does Modeshape build the indexes for new instances added to the cluster? I no longer see the options that were in 3.x to rebuild the indexes on startup?

                    With the new index provider architecture, this is something which is delegated behind the scenes to each index provider. i.e. at repository startup each index provider is responsible for deciding if it wants to rebuild its indexes or not. If it decides to do so, it will always do it asynchronously. The current LocalIndexProvider (MapDB based) will rebuild each index if it does not exist. If it does exist - i.e. there is some data on disk - no reindexing is performed for that index.


                    rhauch can correct me or provide more info, since he designed & implemented the new indexing mechanism.

                    1 of 1 people found this helpful
                    • 7. Re: How to keep cluster indexes synchronized
                      ma6rl

                      hchiorean, thanks for the information above it is helpful.

                       

                      I've been experimenting more with indexes and was able to observe all of the behaviors described above.

                       

                      The biggest issue currently (and I don't appear to be alone) is the time it takes for a new instance to rebuild it's indexes from a large data set. I took a look at [MODE-1903] Rebuild indexes from a point in time - JBoss Issue Tracker and it seems like a good solution to the problem. While queries will still work well the indexes are being rebuilt they may be very slow which means I would not want to direct traffic to a new instance until after the indexes are re-built. If this takes a long time it makes it hard to auto-scale quickly as load increase as new instances will not be useful until the index is rebuilt.

                       

                      The other issue is that the index does not start building until Modeshape starts. Currently when using the sub-system, Modeshape does not start until the a call is made to get the repository, which in my case is when the first query comes in. Is there a way to force the Modeshape sub-system to start when Wildfly starts instead of when the repository is accessed?

                      • 8. Re: How to keep cluster indexes synchronized
                        hchiorean

                        The biggest issue currently (and I don't appear to be alone) is the time it takes for a new instance to rebuild it's indexes from a large data set.

                        if there's lots of data which needs indexing, then I don't think there's a solution for this. The only real solution is to limit the amount of data that's "indexable". I can think of a couple of ways in which this can be done: based on the index definitions - i.e. restrict for example the nodes types that are indexed to only a subset of the repository data or by using the JCR noquery attribute for node definitions, achieving in the end the same result: limiting the amount of data that needs to be indexed.

                         

                        Is there a way to force the Modeshape sub-system to start when Wildfly starts instead of when the repository is accessed?

                        right now no, repositories are started lazily when the first repository.login call is performed. I can understand in your case why auto-starting the repository on server startup would make sense, but you would still need to run the same code essentially if you wanted to wait until re-indexing has finished. You can open an enhancement request to make reindexing synchronicity configurable. In other words, we could make reindexing synchronous as well, in which case the client code would "hang" on repository.login until it finished. Auto-starting repositories is not a good idea IMO if there are lengthy background-operations because that could slow down the server startup sequence significantly, impacting other services not just ModeShape related ones.

                        • 9. Re: How to keep cluster indexes synchronized
                          ma6rl

                          The only real solution is to limit the amount of data that's "indexable". I can think of a couple of ways in which this can be done: based on the index definitions - i.e. restrict for example the nodes types that are indexed to only a subset of the repository data or by using the JCR noquery attribute for node definitions, achieving in the end the same result: limiting the amount of data that needs to be indexed.

                           

                          We already limit what we index to critical properties that we search on, but with millions of assets that we need to query, the index can still grow pretty big

                          Auto-starting repositories is not a good idea IMO if there are lengthy background-operations because that could slow down the server startup sequence significantly, impacting other services not just ModeShape related ones.

                          I see your point and I agree auto starting by default is not a good idea for the reasons you give above but it may be nice to offer a configuration option to do so, assuming there aren't too many technical limitations. In many cases these days people only run a single application per server, if that application depends on Modeshape being up and running having the server not start until Modeshape is up and running may be desirable. Especially compared to the alternative which is having a user wait for the first repository.login to complete before then get a response to their request.

                           

                          At the moment I attempt to workaround the lazy start by using a EJB with a PostConstruct method, the only problem is I can't login using JAAS in a PostConstruct method which means I have to spawn an EJB asynchronous thread and use that to login. This works OK as long as no other process attempts to login to the repository until the login I start in my Async EJB method completes.