Handling cluster state when network partitions occur
brian.stansberry Aug 29, 2007 3:13 PMForum thread to capture some thoughts from a phone discussion today and to continue the conversation. Topic is how stateful clustered applications should deal with network partitions. See necessary background reading at http://www.jgroups.org/javagroupsnew/docs/manual/html/user-advanced.html#d0e2915 for an overview of the problem.
For discussion brevity, I'll refer to 3 of the ideas mentioned in the JGroups doc by shorted:
1) "Merge state". Discussed in 5.6.1 -- JGroups user manages a process to merge the state from the merging partitions.
2) "Primary partition". Discussed in 5.6.2. Nodes that realize upon merge that they weren't in the primary partition flush their state, ensuring that state must be reacquired from a shared persistent store.
3) "Read-only". Nodes realize at the time the partition occurs that they are not in the primary partition, and shift into read-only mode, throwing exceptions if an attempt is made to write. When the partition heals, they get the current state from the primary partition.
Some things mentioned in the discussion today:
* Using timestamps for "merge state" is problematic due to clock synchronization and clock granularity issues.
* "Merge state" is tricky for PojoCache, because the stored pojo doesn't have a natural timestamp or counter that can be used to determine which state to retain. JBC/PojoCache can compare state between nodes, but if there is a conflict there probably needs to be a callback to the application to allow it to decide what to do.
* "Merge state" is more useful to http sessions that don't use FIELD granularity (i.e. don't use PojoCache). Web sessions are supposed to be managed by only one node, so the difficulty of deciding which node's state is correct is reduced. The stored sessions have a timestamp and counter stored in the root node for the sessions subtree. If JBC provided a callback to the JBossCacheManager, the JBCM could easily determine which server's data to retain for each session. Bela pointed out that this reconciliation process could also be done "off-line" via a special service that multiplexes on the cache's channel -- i.e. JBC doesn't have to be directly involved.
* "Read-only" is a possible option for some webapps that don't mind throwing exceptions. E.g. a blog site wants people to be able to continue reading the blog, and doesn't mind if update functions are disrupted during the network partition.
* Entity caching is a tough problem. At first, something like the "primary partition" approach seems good; just dump cached state when the merge occurs and force Hibernate to repopulate the cache from the db. Problem is that during the partition itself the caches can become invalid. Two sub-partitions of {A, B, C} and {D, E} will not know about each others' updates, and thus Hibernate can read stale data from the cache. Variations on the "read-only" approach where {D, E} no longer provides cached data don't solve the problem, as Hibernate on D or E can still update the database, in which case the {A, B, C} caches are invalid.
For this kind of situation, having the application become unavailable on {D, E} is something we need to consider.
This is just a quick summary; I'm sure I missed stuf.