7 Replies Latest reply on Feb 21, 2011 7:37 AM by mircea.markus

    Behavior of infinispan when node is down

    henk53

      I was wondering about the following use case:

       

      Using infinispan, I have a replicated cache. At some moment a node goes down. Some time after that, there will an update to some content in the cache. A while after that the node comes back up.

       

      Will the replicated cache evict the node that is down, and when this node comes back up, will this node automatically sync again with the other nodes?

       

      I could not really find anything about this in the infinispan documentation.

        • 1. Behavior of infinispan when node is down
          feci

          Hi,

           

          I'm not an infinispan developer - just user, but I find out about most functionality through config options:

           

          http://docs.jboss.org/infinispan/4.2/apidocs/config.html#ce_clustering_stateRetrieval

           

          and than some testing, as default values need to be tweaked sometimes...

          • 2. Behavior of infinispan when node is down
            henk53

            Thanks for the reply Tomas!

             

            I'll take a better look at that section. We've seen that it's possible for a node to let its state depend on other nodes (let it initialize its state with the data from another node). The thing is that seems to be true for a node that is newly added to a cluster. It's still not entirely clear to me what happens if a node goes down.

             

            I guess a good amount of testing and twiddling with the settings shall reveal a good deal of this info, but an authoritative answer would still be appreciated

            • 3. Behavior of infinispan when node is down
              brenuart

              When a node goes down and is restarted, it rejoins the cluster (as you say it is newly added to the cluster).

              What this node contains at that moment depends on your configuration:

              • (state retrieval) It can "synchronize" itself with the content currently held in the cluster. This is called "state retrieval". In other words, it will fetch data from other cluster members to have exactly the same content as the others.
              • (loader) It can reload data from its local storage. In this case, the node will revert to the state it has before it crashed. Of course, this content may be obsolete and different from what is currently in the cluster.
              • (cluster loader) You can also ask the node to lazily fetch content from other cluster members when needed. In this case, you don't use the state retrieval feature, but let the node restart with an empty cache. When a request is made for data, since the cache is empty, the request flows to the cluster loader and is retrieved from other members. Everything looks like if the cache was preloaded at that startup.

               

              You can go for a combination of these features to match your needs.

              For example, if you would configure your nodes with stateRetrieval and local persistence (eg. file loader) if you want HA and make sure you can restore the cluster content even if all nodes are down.

               

              Ad Tomas said, look at the documentation - it is not easy as the information is spraid in various documents. But it is definitly worth reading all of them...

               

              Good luck.

               

              /bertrand

              • 4. Behavior of infinispan when node is down
                galder.zamarreno

                Also, when Infinispan is configured with distribution and a node joins or leaves, rehashing occurs which is the process of rebalancing owners of data so that same number of copies are maintained in the cluster. It's implemented in a kinda similar way to state transfer.

                • 5. Behavior of infinispan when node is down
                  henk53

                  Thanks for your elaborate reply. I understand the behavior better now (also after studying the documentation), but there's still one use case I can't wrap my head around:

                   

                  Suppose I have a node A and a Node B in a cluster with some cached content. First node B leaves the cluster. After that event, a few updates are being made to the cache (now consisting only of node A) whereafter node A also leaves the cluster.

                   

                  A little later, node B is restarted and reloads data from its local storage. Then node A is started, which also reloads its data from local storage and synchronizes with the cluster.

                   

                  After some experimenting, it seems that "synchronizing" means that updates or inserts made on B (the cluster) while A was down are successfully synched to A, but the updates A had received before it went down are not being pushed to the cluster. The integrity of the cluster now seems to be broken, as A will see different values for the same key as other members in the cluster.

                   

                  Is their already anything to remedy this or am I doing something wrong?

                  • 6. Behavior of infinispan when node is down
                    brenuart

                    Let's consider another (similar) example:

                    1. Node A and B are in a cluster - they both have the same content - they are replicated;
                    2. Key k1 is created and persisted by both nodes;
                    3. Then B leaves the cluster;
                    4. Key k2 is created and persisted only by node A (which is the only one remaining in the cluster);
                    5. Then node A comes down;
                    6. A while after, you restart only node B - the cluster content is made of k1 only;
                    7. Then comes a request to create a key k2. The request is handled by the only node B;
                    8. Finaly, node A is restarted...

                     

                    What do you expect for the content of key k2? Node A has a version in its persistent store which is not the same as the one held by the custer... Do you discard? Merge? Or else?

                     

                    What I mean is that your example illustrates the case where the entire cluster is bring down. When you restart it, you have to take special care - since with the configuration you expose, the first node will determine the content of your cluster data (in other words, your cluster will restart at the time this node stopped.

                    This is an extreme situation since you will usually make sure that a least one node survive in your cluster... In this case, everything will happen "correctly/as expected" during a node restart.

                    • 7. Re: Behavior of infinispan when node is down
                      mircea.markus
                      What do you expect for the content of key k2? Node A has a version in its persistent store which is not the same as the one held by the custer... Do you discard? Merge? Or else?

                      This would be handled as A joining an exising cluster and (possibly) fetching state from there.

                       

                      This is an extreme situation since you will usually make sure that a least one node survive in your cluster... In this case, everything will happen "correctly/as expected" during a node restart. 

                      yes, agreed that this is a extreme situation. I think at this point it's up to the applicatio to decide what "stale state" means, as there's no way for the system to decide that by itself. Or increse the redundacy and delay the problem