mod_cluster, hot standby, fail-over and failback
evgeniy-khist Dec 4, 2015 7:41 AMHello,
There is a legacy application based on Spring Framework that works as HA singleton. Cluster singleton implementation is custom.
There is active server (SERVER-1) and hot standby server (SERVER-2).
SERVER-2 (hot standby) returns HTTP 404 error on all requests.
This behavior is incompatible with mod_cluster, because SERVER-2 will report an OK status to the Apache httpd reverse proxy, but is actually unable to service requests: thus the Apache httpd reverse proxy will send half of the requests to SERVER-2, which will result in 404 errors on the client side.
mod_cluster offers a "failonstatus" feature: it is possible to list the HTTP error codes that should trigger a failover.
If mod_cluster is configured with failonstatus=404, it will disable for "some time" SERVER-2 and redirect all requests to SERVER-1.
However this does not provide an acceptable solution because:
1) the first request that hit SERVER-2 cannot be recovered and will result in a 404 error on the client side,
2) after "some time", SERVER-2 will be enabled again and will again make some requests fail.
There is no known solution to this issue (http://serverfault.com/questions/414024/apache-httpd-workers-retry)
General way to make server hot standby is by using <simple-load-provider factor="0"/>.
WildFly 9.0.1.Final standalone.xml:
<mod-cluster-config proxies="modcluster-proxy-main modcluster-proxy-backup" balancer="${jboss.modcluster.balancer}" advertise="false" sticky-session="true" ping="300" load-balancing-group="${jboss.modcluster.balancer" connector="default"> <simple-load-provider factor="${modcluster.lbfactor}"/> </mod-cluster-config>
For active server (SERVER-1) modcluster.lbfactor=1
For hot standby server (SERVER-2) modcluster.lbfactor=0
It perfectly handles fail-over but it is static and failback or fail-over second time is not supported.
When SERVER-1 crashes, fail-over occurs and SERVER-2 starts serving requests. But when SERVER-1 is started again it can't serve requests, cause HA singleton is already working on SERVER-2. As far, as SERVER-1 has factor=1 mod_cluster redirects all requests to it. All requests to SERVER-1 results in HTTP 404 error, cause HA singleton service is running on SERVER-2 after fail-over. So failback is not supported by this approach.
Other solution: in order to implement fail-over, it is possible to design a "custom load metric" for mod_cluster.
This load metric will report a load of 0% when the server is the active node and 100% when the server is the passive node.
As a result, the reverse proxy will redirect all requests to the active node.
Snowdrop is used to access the JBoss MBean server (http://docs.jboss.org/snowdrop/4.0.0.Final-docs/SnowdropGuide.html#_accessing_the_default_jboss_mbean_server).
Spring JMX integration bean is used to expose load metrics into the JBoss MBean server (http://docs.spring.io/autorepo/docs/spring/3.1.x/spring-framework-reference/html/jmx.html)
WildFly 9.0.1.Final standalone.xml:
<subsystem xmlns="urn:jboss:domain:modcluster:2.0"> <mod-cluster-config proxies="modcluster-proxy-main modcluster-proxy-backup" balancer="${jboss.modcluster.balancer}" advertise="false" sticky-session="true" ping="300" load-balancing-group="${jboss.modcluster.balancer}" connector="default"> <dynamic-load-provider history="0"> <custom-load-metric class="org.jboss.modcluster.load.metric.impl.MBeanAttributeLoadMetric"> <property name="pattern" value="example:name=modClusterMasterSlaveLoadMetric"/> <property name="attribute" value="MasterSlaveLoad"/> </custom-load-metric> </dynamic-load-provider> </mod-cluster-config> </subsystem>
modClusterMasterSlaveLoadMetric is an MBean exported by application at startup.
It is a simple Java class with a boolean method:
/** * * @return a load of 1 if the node is the master, and 100 otherwise. */ @ManagedAttribute(description = "The load is 1 if the node is the master, and 100 otherwise") public int getMasterSlaveLoad() { if (clusteredSingletonRunner.isMaster()) { logger.debug("ModClusterMasterSlaveLoadMetric reports the node as being in master mode"); return 1; } else { logger.debug("ModClusterMasterSlaveLoadMetric reports the node as being in slave mode (stand-by)"); return 100; } }
It does not work possibly for 2 reasons:
1) race condition within mod_cluster: at startup mod_cluster tries to get the status of the cluster before the application is fully deployed,
2) for some reason mod_cluster redirect to the wrong node in 1% of the requests.
Please advice solution that will support fail-over and failback.
Thanks in advance.