Handling cluster state when network partitions ...| JBoss.org Content Archive (Read Only)

15. Re: Handling cluster state when network partitions occur

brian.stansberry Sep 14, 2007 11:09 AM (in response to brian.stansberry)

"manik.surtani@jboss.com" wrote:
Ok, how about:

* Provide a MergeHandler configuration
* Defaults to IgnoreOnMerge, which does nothing. FlushOnMerge also provided, which flushes in-memory state on merge
* Allows users to write their own MergeHandlers, which are basically called when a MergeView is received.
* Users could also do something similar in a CacheListener's viewChange() callback.

To make any kind of custom MergeHandler useful, JBC would need to expose an API to let the handler get the info it needs from around the cache. That's likely to be a pretty complex API.

But more important than this, how can an instance tell if there has been a split, and it is not in the primary partition *before* a merge takes place? E.g., in your example of {A,B,C} and {D,E}, how can instance E decide to shutdown and not process requests?

One way is the kind of static configuration Bela mentioned. User expect he a 5 node cluster; declares 3 as the minimum primary partition size.

Another flawed way is tracking view size and when views change. As the view size increases, keep recalculating your primary partition size. As view size decreases, check how rapidly it's decreasing. A rapid decrease below the primary partition size == cluster split. A slow decrease == normal leaving of nodes, so recalc the primary partition size.

16. Re: Handling cluster state when network partitions occur

belaban Sep 14, 2007 11:19 AM (in response to brian.stansberry)

"bstansberry@jboss.com" wrote:

To make any kind of custom MergeHandler useful, JBC would need to expose an API to let the handler get the info it needs from around the cache. That's likely to be a pretty complex API.

This does not necessarily need to be provided by JBossCache; you could have an external service (e.g. an RpcDispatcher/ClusterPartition running on a Multiplexer), and fetching the necessary information *outside of JBossCache*.

Another flawed way is tracking view size and when views change. As the view size increases, keep recalculating your primary partition size. As view size decreases, check how rapidly it's decreasing. A rapid decrease below the primary partition size == cluster split. A slow decrease == normal leaving of nodes, so recalc the primary partition size.

That's a bit non-deterministic, what is *slow* ? It is as sloppy as using timeouts...
Any policy works to pick the primary partition, as long as it is deterministic. Note that here we're not just talking about *MergViews* but regular *Views* as well !
For example, a SplitBrainPolicy for primary partition could be
- 3 members or more
OR
- Host A is part of the view

I think this works pretty well, for views and merge views. When not in the primary partition, we could shut down the node. Ie. disconnect from the channel, disable the Connector (in JBossWeb) and/or send 404 or some HTTP error codes back to mod_jk so Apache knows not to send any more requests to this node.

17. Re: Handling cluster state when network partitions occur

brian.stansberry Sep 14, 2007 11:40 AM (in response to brian.stansberry)

I said right off it was flawed didn't I? :)

Only benefit to it is it's dynamic. But, I suppose if people want things to be completely dynamic they shouldn't expect this kind of functionality. They should be using redundant power supplies, IP bonding and redundant networks.

Re: disabling access if this occurs, there's no standard HTTP status code (e.g. 410) to tell a load balancer to fail over. All the HTTP status codes propagate to the client. With mod_jk, they're working on a much more rich communication between the Tomcat side and the Apache side, so all sorts of status information about the cluster could be passed. With other load balancers from major vendors we could send a custom HTTP header; we'd need to coordinate with the vendors to get them to understand the header. If none of those options are available, we'd have to shut down the connector.

I want to have pluggable policies for configuring this kind of thing. An expansion on the existing "useJK" flag.

18. Re: Handling cluster state when network partitions occur

belaban Sep 14, 2007 11:48 AM (in response to brian.stansberry)

Yes, if Apache intercepted special codes like 410, that'd be great. Workaround currently would be to invoke http://apache-host:port/status/&mode=disabled and that is a kludge.
What's the status of this rich AJP' protocol ?

19. Re: Handling cluster state when network partitions occur

brian.stansberry Sep 14, 2007 11:57 AM (in response to brian.stansberry)

When discussed with Mladen in June, it was a WIP, but lower priority than AS 5 tasks.

20. Re: Handling cluster state when network partitions occur

manik Sep 14, 2007 12:18 PM (in response to brian.stansberry)

I inherently don't like using a static cluster size configuration. With larger clusters, nodes joining and leaving the cluster should be seen as "normal" and it should be able to scale up or down without reconfiguration.

Correct me if I am wrong, but if you have {A, B, C} on switch S1 and {D, E} on switch S2, and if S2 fails, does JGroups deliver 2 view changes to {A, B, C} (one without D and one without both D and E) or just a single view change, without D and E? Assuming FD_SOCK is used for immediate switch failure detection?

If this is the case, can't a split brain threshold be used? I.e., if a view change is received in which more than N% of members are removed, assume a split brain as opposed to normal drop-offs?

21. Re: Handling cluster state when network partitions occur

manik Sep 14, 2007 12:29 PM (in response to brian.stansberry)

Basically pretty much the same as a static configuration except that the cluster size used is the last known number of members.

Better still, how about maintain an average number of members over time and use this as a cluster size? Unlikely that a cluster would grow by _doubling_ the number of servers or anything like that ...

22. Re: Handling cluster state when network partitions occur

belaban Sep 14, 2007 12:44 PM (in response to brian.stansberry)

"manik.surtani@jboss.com" wrote:
I inherently don't like using a static cluster size configuration. With larger clusters, nodes joining and leaving the cluster should be seen as "normal" and it should be able to scale up or down without reconfiguration.

This is not a static configuration, e.g. {A,B,C,D,E}. The only thing that's static is the primary partition size.

Correct me if I am wrong, but if you have {A, B, C} on switch S1 and {D, E} on switch S2, and if S2 fails, does JGroups deliver 2 view changes to {A, B, C} (one without D and one without both D and E) or just a single view change, without D and E? Assuming FD_SOCK is used for immediate switch failure detection?

Depends. usually you get 2 view changes, sometimes 1. Remember that if a switch goes down, the TCP connection in FD_SOCK will *not* be closed, so we'd have to rely on FD here.

If this is the case, can't a split brain threshold be used? I.e., if a view change is received in which more than N% of members are removed, assume a split brain as opposed to normal drop-offs?

No, because you can't rely on JGroups excluding all 'dead' members in *1* view. This depends on the failure detection and group membership protocol impl, and they might all be different.

23. Re: Handling cluster state when network partitions occur

manik Dec 4, 2007 8:04 AM (in response to brian.stansberry)

Decided to provide a policy based approach here. When we get a JGroups notification of a merge, provide plugins that allow for:

1) Evict entire cluster state (useful with Hibernate)
2) Evict cache state of the losing sub-cluster
3) Throw exception, stop cache (useful if we need to force some manual cleanup)
4) Perform merging of disjointed subtrees. (Maybe later - this "plugin" could be released in future)