Handling Cluster Merges -- HASingleton and the "Split-Brain" Syndrome
Split-brain Syndrome and Merging
"Split-Brain" syndrome refers to a condition where some or all of the nodes in a cluster lose the ability to communicate with each other, with the result that two or more subclusters form. Each of these subclusters thinks its the entire cluster, and that all the other nodes in the cluster have failed.
A "split brain" condition is usually due to some sort of network failure preventing communication between members, aka a "network partition". In JBoss AS, such a failure would result in the JGroups failure detection protocols assuming unreachable members have failed and excluding them from the group, with each set of nodes that can still communicate forming a subcluster. (In a case of switch failure, you could end up with multiple subclusters with only one member each.)
The standard JGroups configurations in JBoss AS include the MERGE2 protocol, whose purpose is to ensure that once the communication problem is resolved any subclusters become aware of each other and once again form a single group. When two or more subclusters form a single cluster, this is referred to as a "cluster merge".
An important thing to understand about a cluster merge is that JGroups has no way of understanding the internal state of the application that uses it. Therefore, when a merge occurs, JGroups cannot by itself integrate any application state that may have become inconsistent between the subclusters. JGroups provides a callback via its MembershipListener interface to notify an interested application that a merge has occurred; it's then up to the application to decide what to do.
Effect on an HASingleton
For an HASingleton service, the effect of "split-brain" syndrome will be that as long as the condition lasts, there will be more than one node providing the service. Each subcluster will have a "master" that will provide the service. When a cluster merge occurs, all but one of these nodes will stop providing the service, and the surviving master will carry on providing the service undisturbed.
This approach to merging is adequate for services that do not maintain in-memory state, or whose in-memory state wouldn't become inconsistent while there is more than one node providing the service. But it's not adequate when the surviving master's in-memory state needs to be updated to reflect any activity performed by the other masters (see http://jira.jboss.com/jira/browse/JBAS-4229).
To help an HASingleton better deal with cluster merges, beginning in JBoss AS 4.2.0.GA the core class used for managing singletons org.jboss.ha.singleton.HASingletonSupport is now provided with information to make it aware when a cluster merge has occurred. When a merge occurs, if a node is the surviving master HASingletonSupport can optionally restart the singleton service. The assumption here is that when the service is started, it will properly configure its in-memory state.
Any MBean based on HASingletonSupport can now configure whether this service restart should occur upon a cluster merge. The mbean descriptor would include the following
<mbean code=".... <attribute name="RestartOnMerge">false</attribute> ... </mbean>
NOTE: In 4.2.0.GA and later, the default value for RestartOnMerge is true. This means any HASingleton will by default restart if a cluster merge occurs.
Impact on the deploy-hasingleton directory and the HASingleton Barrier controller
By default, RestartOnMerge is set to true in the deploy-hasingleton-service.xml file. This means that following a cluster merge, all contents of the deploy-hasingleton dir will be undeployed and redeployed, and any beans that depend on jboss.ha:service=HASingletonDeployer,type=Barrier will be stopped and restarted.
This redeploy of deploy-hasingleton content is needed for proper functioning of HA-JBossMQ.
If you don't want some of your content redeployed, the best solution is to create differently named duplicate versions of the beans in deploy-hasingleton-service.xml and create a duplicate of the deploy-hasingleton directory. Have the duplicate version of the jboss.ha:service=HASingletonDeployer bean use your new directory as the argument to TargetStartMethod and TargetStopMethod. On the duplicate bean, set RestartOnMerge to false.
Using RestartOnMerge in JBoss AS 4.0.x
The RestartOnMerge functionality described above has been ported to the 4.0.x branch for inclusion in any 4.2.0 or later release. It will also be included in the Cumulative Patch releases for AS 4.0.3.SP1, 4.0.4 and 4.0.5 that subscription customers receive. It will be included in the April 2007 CP for AS 4.0.3.SP1 and in the May CP's for 4.0.4 and 4.0.5.
However, for the 4.0.x releases, using true as the default value for RestartOnMerge was considered to be too large a change for a micro release, and certainly too big a change for a CP. In addition, the getter/setter methods for the RestartOnMerge property were not added to the HASingletonControllerMBean, on the slight chance that someone has created an independent implementation of that interface. Instead, a new interface RestartOnMergeHASingletonController extends HASingletonControllerMBean was added to expose the getter/setter, and a trivial RestartOnMergeHASingletonController class added that exposes the new interface.
In 4.0.x, to take advantage of the RestartOnMerge behavior, the configuration of any HASingletonController needs to be changed to set the RestartOnMerge attribute to true. User's may have multiple instances of HASingletonController deployed; the one that is there by default is in the server/all/deploy/deploy-hasingleton-service.xml file. To update that file, change the existing configuration from:
<mbean code="org.jboss.ha.singleton.HASingletonController" name="jboss.ha:service=~HASingletonDeployer"> ... </mbean>
<mbean code="org.jboss.ha.singleton.RestartOnMergeHASingletonController" name="jboss.ha:service=HASingletonDeployer"> <attribute name="RestartOnMerge">true</attribute> ... </mbean>
Making this change is definitely recommended if HA-JBossMQ is used.
An equivalent change can be made to any other HASingletonController deployment.