HASingleton election and communication with mod_cluster
brian.stansberry Jul 16, 2008 10:57 PMSee http://wiki.jboss.org/wiki/en/ModClusterDesign
and http://wiki.jboss.org/wiki/Mod-Cluster_AS_integration for background.
HASingleton election refers to the process by which one of a set of nodes in a cluster is chosen as the one that will act as the singleton master. Upon receiving events generated by the AS's DistributedReplicantManager (DRM) service, each node in the cluster independently decides who will be the master using an HASingletonElectionPolicy impl. The key requirement of the policy impl is that its decision must be deterministic, so every node makes the same decision.
An HASingletonElection policy impl can use whatever data it wants to make its decision. The AS ships with a couple impls, the most basic of which uses the nodes' ordinal position in the current JGroups view to pick the master.
For the ModClusterService use case, there is an additional important factor in the election decision: a node's ability to communicate with the httpd servers that are proxying requests. A ModClusterService instance that can only communicate with one of four httpd servers is less desirable as the singleton master than one that can communicate with all four.
An added twist on this is certain httpd servers are more valuable than others. An htttpd server that has received configuration messages from the JBoss side is more valuable than one that hasn't. I call such a server "established". Imagine a scenario where there are 4 httpd servers, with only #1 and #2 established. Node A can communicate with #1 and #2, while B can communicate with #2 as well as a newly discovered #3 and #4> Node A should be the master even though B can communicate with more httpd servers. B has no ability to inform httpd #1 of future config changes or load balance factor changes, potentially leaving it in an invalid state.
The solution I see for managing the election process is to include information about a node's ability to communicate with the httpd servers inside the object it registers with the DistributedReplicantManager service. The object registered with DRM is available to HASingletonElectionPolicy impls, so they can use this information in their election decisions. So that's fairly straightforward.
What's more tricky is ensuring that timing effects in the recognition of httpd server topology changes don't lead to changes in the HASingleton master. For example:
a) httpd #5 comes on line and advertises itself. JBoss node B notices this slightly before master node A and takes over as singleton master, only to have A become the master again shortly later.
b) communication to httpd #3 is disrupted. JBoss node A (the master) notices this slightly before node B, and thus gets unelected as master, only to end up being re-elected shortly thereafter as B notices the trouble.
A possible protocol for handling this:
1) All nodes can "discover" httpd servers. "Discover" means learning the address:port of the httpd server either by parsing a config at startup, detecting one via an AdvertiseListener, or possibly by having a new address:port added by an administrative tool..
2) All nodes delegate to the current HASingleton master to manage the process of including new httpd servers in the available pool or removing leaving ones. So, if non-master node B discovers a new httpd server, it sends a message to master node A advising of the new server. Node B does nothing further to initiate communication with the new httpd server. It does keep a reference to the new/removed server address:port so it can resend the message to a new master if the current master fails.
3) The master node updates its MCMPHandler with the address/port of all discovered servers, whether discovered locally or via a message from another node. (The MCMPHandler has to be able to deal with duplicate adds).
4) During the regular status processing:
a) the master performs a status check on its own MCMPHandler -- i.e. adds newly discovered httpd servers, removes any it has been told no longer exist, tries to recover any in ERROR state via sending (cluster-wide) configuration data.
b) the master can now check how many httpd servers it knows about, how many are healthy, how many are established.
c) master sends an RPC to the cluster providing the address/port of all the currently known servers, along with a flag as to whether they are established. The other nodes use this to add/remove servers from their MCMPHandler, and then have their MCPHandler perform a status check. This status check *does not* involve sending configuration messages to recover any in ERROR state; instead an innocuous message like INFO should be sent. Goal is simply to test the ability to communicate with the httpd server. Only the master node sends configuration requests to the httpd side.
d) each non-master node checks how many httpd servers it knows about, how many are healthy, how many are established and responds to the RPC with this information.
e) the master now has all the information that would go into an HASingleton election. If the state of any of the nodes has changed, that node will need to update its object stored in the DistributedReplicantManager service. Any time the DRM is updated, a new election occurs. The question is in what order nodes should make their update. If the master node itself has changed state, it needs to decide whether to update the DRM before telling other nodes to do so, or to let the other nodes update first, and then make its update. If one choice or the other will result in the current master being reelected, the master should use that order for the next 2 steps:
f) master sends a message telling the other nodes to update the DRM with the state they sent in step d).
g) master updates the DRM with its current state
h) upon receipt of message from f), any reference to a new/removed server the node may have cached in step 2) above is cleared (since step d/e is completed, so all nodes in the cluster know about the new httpd server)
Failure handling in the above:
1) Step 2 above. Problem of dropping discovery messages. See 4.h. If a new master is elected before 4.h occurs, resend the discovery message to the new master.
2) Node doesn't respond to the 4.d RPC or responds with exception. In 4.g, the master instructs any such node to report itself to DRM as being in an error state; the election policy should treat such a node as being last in line for election as master.
3) Master fails after 4.e. The failure is detected so a new master is chosed, but the election is based on outdated information. Whoever is elected master, the next cycle the DRM will be updated with current information and a new election can occur, possibly resulting in yet another change in master. Need to decide how to deal with that; i.e. how should the "temporary" new master react? Depends on what a new master does anyway. If taking over as master doesn't involve any heavy work (i.e. refreshing all the configs on all the httpd servers), then this situation is harmless. At the moment I don't think taking over as master will involve heavy work.
4) Master fails after 4.f, doesn't complete 4.g. An election will occur after 4.f, using outdated data for the master node. Election can have two results: i) the master is re-elected, in which case the cluster just waits for the master to be suspected, forcing a new election. This is fine. ii) the master is not re-elected and a new node takes over as master, which is fine.
5) Master chooses 4.g before 4.f but fails after 4.g. Same as 3) above.