0 Replies Latest reply on Dec 4, 2015 7:41 AM by evgeniy-khist

mod_cluster, hot standby, fail-over and failback

evgeniy-khist Dec 4, 2015 7:41 AM

Hello,

There is a legacy application based on Spring Framework that works as HA singleton. Cluster singleton implementation is custom.

There is active server (SERVER-1) and hot standby server (SERVER-2).

SERVER-2 (hot standby) returns HTTP 404 error on all requests.

This behavior is incompatible with mod_cluster, because SERVER-2 will report an OK status to the Apache httpd reverse proxy, but is actually unable to service requests: thus the Apache httpd reverse proxy will send half of the requests to SERVER-2, which will result in 404 errors on the client side.

mod_cluster offers a "failonstatus" feature: it is possible to list the HTTP error codes that should trigger a failover.

If mod_cluster is configured with failonstatus=404, it will disable for "some time" SERVER-2 and redirect all requests to SERVER-1.

However this does not provide an acceptable solution because:

1) the first request that hit SERVER-2 cannot be recovered and will result in a 404 error on the client side,

2) after "some time", SERVER-2 will be enabled again and will again make some requests fail.

There is no known solution to this issue (http://serverfault.com/questions/414024/apache-httpd-workers-retry)

General way to make server hot standby is by using <simple-load-provider factor="0"/>.

WildFly 9.0.1.Final standalone.xml:

<mod-cluster-config proxies="modcluster-proxy-main modcluster-proxy-backup" balancer="${jboss.modcluster.balancer}" advertise="false" sticky-session="true" ping="300" load-balancing-group="${jboss.modcluster.balancer" connector="default">
     <simple-load-provider factor="${modcluster.lbfactor}"/>
 </mod-cluster-config>

For active server (SERVER-1) modcluster.lbfactor=1
For hot standby server (SERVER-2) modcluster.lbfactor=0

It perfectly handles fail-over but it is static and failback or fail-over second time is not supported.

When SERVER-1 crashes, fail-over occurs and SERVER-2 starts serving requests. But when SERVER-1 is started again it can't serve requests, cause HA singleton is already working on SERVER-2. As far, as SERVER-1 has factor=1 mod_cluster redirects all requests to it. All requests to SERVER-1 results in HTTP 404 error, cause HA singleton service is running on SERVER-2 after fail-over. So failback is not supported by this approach.

Other solution: in order to implement fail-over, it is possible to design a "custom load metric" for mod_cluster.

This load metric will report a load of 0% when the server is the active node and 100% when the server is the passive node.

As a result, the reverse proxy will redirect all requests to the active node.

Snowdrop is used to access the JBoss MBean server (http://docs.jboss.org/snowdrop/4.0.0.Final-docs/SnowdropGuide.html#_accessing_the_default_jboss_mbean_server).

Spring JMX integration bean is used to expose load metrics into the JBoss MBean server (http://docs.spring.io/autorepo/docs/spring/3.1.x/spring-framework-reference/html/jmx.html)

WildFly 9.0.1.Final standalone.xml:

<subsystem xmlns="urn:jboss:domain:modcluster:2.0">
    <mod-cluster-config proxies="modcluster-proxy-main modcluster-proxy-backup" balancer="${jboss.modcluster.balancer}" advertise="false" sticky-session="true" ping="300" load-balancing-group="${jboss.modcluster.balancer}" connector="default">
        <dynamic-load-provider history="0">
            <custom-load-metric class="org.jboss.modcluster.load.metric.impl.MBeanAttributeLoadMetric">
                <property name="pattern" value="example:name=modClusterMasterSlaveLoadMetric"/>
                <property name="attribute" value="MasterSlaveLoad"/>
            </custom-load-metric>
        </dynamic-load-provider>
    </mod-cluster-config>
</subsystem>

modClusterMasterSlaveLoadMetric is an MBean exported by application at startup.
It is a simple Java class with a boolean method:

/**
 * 
 * @return a load of 1 if the node is the master, and 100 otherwise.
 */
@ManagedAttribute(description = "The load is 1 if the node is the master, and 100 otherwise")
public int getMasterSlaveLoad() {
    if (clusteredSingletonRunner.isMaster()) {
        logger.debug("ModClusterMasterSlaveLoadMetric reports the node as being in master mode");
        return 1;
    } else {
        logger.debug("ModClusterMasterSlaveLoadMetric reports the node as being in slave mode (stand-by)");
        return 100;
    }
}

It does not work possibly for 2 reasons:
1) race condition within mod_cluster: at startup mod_cluster tries to get the status of the cluster before the application is fully deployed,
2) for some reason mod_cluster redirect to the wrong node in 1% of the requests.

Please advice solution that will support fail-over and failback.

Thanks in advance.