1 2 Previous Next 23 Replies Latest reply on Dec 4, 2007 8:04 AM by manik

    Handling cluster state when network partitions occur

    brian.stansberry

      Forum thread to capture some thoughts from a phone discussion today and to continue the conversation. Topic is how stateful clustered applications should deal with network partitions. See necessary background reading at http://www.jgroups.org/javagroupsnew/docs/manual/html/user-advanced.html#d0e2915 for an overview of the problem.

      For discussion brevity, I'll refer to 3 of the ideas mentioned in the JGroups doc by shorted:

      1) "Merge state". Discussed in 5.6.1 -- JGroups user manages a process to merge the state from the merging partitions.
      2) "Primary partition". Discussed in 5.6.2. Nodes that realize upon merge that they weren't in the primary partition flush their state, ensuring that state must be reacquired from a shared persistent store.
      3) "Read-only". Nodes realize at the time the partition occurs that they are not in the primary partition, and shift into read-only mode, throwing exceptions if an attempt is made to write. When the partition heals, they get the current state from the primary partition.

      Some things mentioned in the discussion today:

      * Using timestamps for "merge state" is problematic due to clock synchronization and clock granularity issues.

      * "Merge state" is tricky for PojoCache, because the stored pojo doesn't have a natural timestamp or counter that can be used to determine which state to retain. JBC/PojoCache can compare state between nodes, but if there is a conflict there probably needs to be a callback to the application to allow it to decide what to do.

      * "Merge state" is more useful to http sessions that don't use FIELD granularity (i.e. don't use PojoCache). Web sessions are supposed to be managed by only one node, so the difficulty of deciding which node's state is correct is reduced. The stored sessions have a timestamp and counter stored in the root node for the sessions subtree. If JBC provided a callback to the JBossCacheManager, the JBCM could easily determine which server's data to retain for each session. Bela pointed out that this reconciliation process could also be done "off-line" via a special service that multiplexes on the cache's channel -- i.e. JBC doesn't have to be directly involved.

      * "Read-only" is a possible option for some webapps that don't mind throwing exceptions. E.g. a blog site wants people to be able to continue reading the blog, and doesn't mind if update functions are disrupted during the network partition.

      * Entity caching is a tough problem. At first, something like the "primary partition" approach seems good; just dump cached state when the merge occurs and force Hibernate to repopulate the cache from the db. Problem is that during the partition itself the caches can become invalid. Two sub-partitions of {A, B, C} and {D, E} will not know about each others' updates, and thus Hibernate can read stale data from the cache. Variations on the "read-only" approach where {D, E} no longer provides cached data don't solve the problem, as Hibernate on D or E can still update the database, in which case the {A, B, C} caches are invalid.

      For this kind of situation, having the application become unavailable on {D, E} is something we need to consider.


      This is just a quick summary; I'm sure I missed stuf.

        • 1. Re: Handling cluster state when network partitions occur
          brian.stansberry

          http://www.jboss.com/index.html?module=bb&op=viewtopic&t=117402 is a container JIRA with links to related JIRAs for web sessions, sfsb, etc.

          • 2. Re: Handling cluster state when network partitions occur
            belaban

            (1) The problem with merging state is that it is very dependent on the application and also, in most cases, non trivial to do right. As Brian mentioned, merging web session state is simple, but what happens if an HTTP session has a Pojo as value ? Or even worse, an entity bean which has a primary key and is persistent ? Then anyone can get a ref to that bean and change it, so we cannot assume the session is the only way to access an entity bean.

            (2) Primary partition
            This is generally good, as we avoid a merging because the non-primary partition(s) discard their state and re-initialize themselves from the primary partition. However, this means that all state changes that happened in the non-primary partition is discarded ! This is probably not acceptable for most applications.

            (3) Primary partitions detected when they happen
            If we can say that (assuming we have 5 nodes in the cluster) a primary partition needs to have 3 or more nodes and - if it has less than 3 - becomes read-only, then we avoid state changes in a non-primary partition.
            We could define a primary partition to be the majority, requiring an odd number of nodes, or come up with a different *deterministic* scheme.
            Problem is: if we have {A,B,C,D,E} and then 2 partitions P1={A,B,C} and P2={D,E}, and they're all connected to a DB, then changes within P1 are replicated within P1 and persisted to the DB, but *not* replicated to P2. This means that applications connected to P2 will read stale data !
            In this case, the only solution I see is to shut down P2 and redirect clients if that's possible. Shutting down could mean
            - Leave the cluster (P2), so D and E disconnect from the cluster
            - Suspend (or close) the AJP connector (if we come in via Apache mod_jk), or *somehow* let the load balancer know that it should redirect requests to P1 and not P2 anymore. This of course requires the load balancer to have visibility to P1and P2.

            My 2 cents

            • 3. Re: Handling cluster state when network partitions occur
              belaban

              (3) Another solution rather than shutting down is to cease caching values and pass all requests directly on to the underlying DB. Of course, this makes only sense if we *have* a DB, otherwise shutting down is probably our own option

              • 4. Re: Handling cluster state when network partitions occur
                galder.zamarreno

                (1) Re: Merget state for Pojo. Callback seems most reasonable option here. Maybe we could instruct users to implement a method that looks like:

                void merge(Object o)

                Where Object o is merged with by the Pojo where the method is called, User's would then merge different instance attributes and call its setters which would be intercepted. We could also say that this method would only be called in the Pojo version in the primary partition (any node in the primary partition would do cos this partition would have the same state)

                If the network partition resulted in N subnetworks, merge() could get a Collection of N Pojos, one Pojo for each subnetwork Pojo.

                • 5. Re: Handling cluster state when network partitions occur
                  brian.stansberry

                   

                  "bela@jboss.com" wrote:
                  (3) Another solution rather than shutting down is to cease caching values and pass all requests directly on to the underlying DB. Of course, this makes only sense if we *have* a DB, otherwise shutting down is probably our own option


                  This would need to happen on all partitions, though, otherwise you have the problem of the minor partitions updating the database, and the primary partition then reading stale data from it's cache.

                  Ceasing caching on all partitions is a problem though, as it's not clear how the primary partition distinguishes the cluster split from normal operation.

                  • 6. Re: Handling cluster state when network partitions occur
                    belaban

                    You're right, so it looks like the only solution here is to shtu down that node and fail over clients

                    • 7. Re: Handling cluster state when network partitions occur
                      galder.zamarreno

                      It'd be cool if JGroups could hook into network monitoring tools and be able to be notified of network partitions or cluster split situations.

                      • 8. Re: Handling cluster state when network partitions occur
                        belaban

                        How do these tool determine whether we have a partition, or whether some members just crashed ? Nothing different from what we want to do: they need a deterministic decision, e.g. through an independent arbiter: shared file system, DB, or separate network, or through majority decision as discussed before.
                        In other words, those tools won't solve the problem for us, either.

                        • 9. Re: Handling cluster state when network partitions occur
                          galder.zamarreno

                          Yeah, you're right on that. There's no guarantee that such independent arbiter would be available too.

                          • 10. Re: Handling cluster state when network partitions occur
                            brian.stansberry

                            The "merge state" approach gets another wrinkle when you factor in buddy replication"

                            1) Good. If you're holding buddy backup data for a node that wasn't in your subpartition (i.e. D has A's backup data when the split was {A B C} {D E}) then when you get the MergeView and you see A was still there, you should be able to safely discard your old backup of A's data.

                            2) Bad. With total replication, the subpartition coordinators can handle the state merging task, since they have a copy of all the data (e.g. A and D could compare state). With buddy replication, everyone has to reconcile with all nodes in the foreign partition, since any one of them could have become an owner of data during the split (D reconciles with A,B,C; E reconciles with A,B,C).


                            The above is not the same when data partitioning is used instead of buddy replication. Since the goal is to replace BR with data partitioning, maybe we will end not having to worry about the BR wrinkles. :)

                            • 11. Re: Handling cluster state when network partitions occur
                              manik

                               

                              "galder.zamarreno@jboss.com" wrote:
                              (1) Re: Merget state for Pojo. Callback seems most reasonable option here. Maybe we could instruct users to implement a method that looks like:

                              void merge(Object o)


                              I think merging is nasty, no matter whichever way you slice it. Even if a call back is used, and even if time stamps are recorded, you still have clock sync issues, plus the problem where some nodes have used stale data as input to other transactions creating a GIGO situation when the partition heals.

                              There may be a few specific applications where merging may work, but I can't see how we can devise a generic solution around merging.

                              • 12. Re: Handling cluster state when network partitions occur
                                manik

                                I think the primary partition approach is best. Caches not in the primary partition purging their in memory state is probably the wrong path though, since as a generic solution, not all installations will be backed by shared databases.

                                Caches shutting down would be my preferred option. Perhaps block for a short period, hoping the network would heal, and then throw an exception after a timeout. Perhaps a specific exception - SplitBrainException or something - so that cache users such as HTTP Replication can react by forcing an HTTP response like 410 (don't know if this is possible - Brian?) such that the load balancer will treat the node as unavailable. Once the partition heals the cache is made available to requests again after performing a state transfer to come up to speed with the primary partition.

                                Even the impact of incorrectly identifying a primary partition is low, since at worst case, the larger partition is unresponsive while the smaller one is. I guess the real problem is more than one partition thinking it is primary. :-)

                                • 13. Re: Handling cluster state when network partitions occur
                                  belaban

                                  So what are you guys going to do ? The reason I brought this up is that we need to implement merging policies in *certain* apps. For example, Brian said that he thinks a simple union-merge would be sufficient for HTTP session replication (don't know the effect on Buddy Replication though). For Hibernate 2nd level cache we might go into non-operational state and so on.
                                  I'd like to identify the cases where we *can* do something and implement them.
                                  For the other cases, we should provide a number of policies that a user can choose from. In some cases, a user mgith also want to handle viewChange(MergeView) himself, so we need to document what can be done.

                                  • 14. Re: Handling cluster state when network partitions occur
                                    manik

                                     

                                    Ok, how about:

                                    * Provide a MergeHandler configuration
                                    * Defaults to IgnoreOnMerge, which does nothing. FlushOnMerge also provided, which flushes in-memory state on merge


                                    Good. This would allow Brian to write his union-merge policy for HTTP session merging.


                                    But more important than this, how can an instance tell if there has been a split, and it is not in the primary partition *before* a merge takes place? E.g., in your example of {A,B,C} and {D,E}, how can instance E decide to shutdown and not process requests?


                                    Like I explained in http://www.jgroups.org/javagroupsnew/docs/manual/html/user-advanced.html#d0e2956 (5.6.2 and 5.6.3).

                                    1 2 Previous Next