0 Replies Latest reply on Dec 4, 2015 7:41 AM by evgeniy-khist

    mod_cluster, hot standby, fail-over and failback

    evgeniy-khist

      Hello,

       

      There is a legacy application based on Spring Framework that works as HA singleton. Cluster singleton implementation is custom.

      There is active server (SERVER-1) and hot standby server (SERVER-2).

      SERVER-2 (hot standby) returns HTTP 404 error on all requests.

       

      This behavior is incompatible with mod_cluster, because SERVER-2 will report an OK status to the Apache httpd reverse proxy, but is actually unable to service requests: thus the Apache httpd reverse proxy will send half of the requests to SERVER-2, which will result in 404 errors on the client side.

       

      mod_cluster offers a "failonstatus" feature: it is possible to list the HTTP error codes that should trigger a failover.

      If mod_cluster is configured with failonstatus=404, it will disable for "some time" SERVER-2 and redirect all requests to SERVER-1.

      However this does not provide an acceptable solution because:

      1) the first request that hit SERVER-2 cannot be recovered and will result in a 404 error on the client side,

      2) after "some time", SERVER-2 will be enabled again and will again make some requests fail.

      There is no known solution to this issue (http://serverfault.com/questions/414024/apache-httpd-workers-retry)

       

      General way to make server hot standby is by using <simple-load-provider factor="0"/>.

       

      WildFly 9.0.1.Final standalone.xml:

      <mod-cluster-config proxies="modcluster-proxy-main modcluster-proxy-backup" balancer="${jboss.modcluster.balancer}" advertise="false" sticky-session="true" ping="300" load-balancing-group="${jboss.modcluster.balancer" connector="default">
           <simple-load-provider factor="${modcluster.lbfactor}"/>
       </mod-cluster-config>
      
      

       

      For active server (SERVER-1) modcluster.lbfactor=1
      For hot standby server (SERVER-2) modcluster.lbfactor=0

       

      It perfectly handles fail-over but it is static and failback or fail-over second time is not supported.

       

      When SERVER-1 crashes, fail-over occurs and SERVER-2 starts serving requests. But when SERVER-1 is started again it can't serve requests, cause HA singleton is already working on SERVER-2. As far, as SERVER-1 has factor=1 mod_cluster redirects all requests to it. All requests to SERVER-1 results in HTTP 404 error, cause HA singleton service is running on SERVER-2 after fail-over. So failback is not supported by this approach.

       

      Other solution: in order to implement fail-over, it is possible to design a "custom load metric" for mod_cluster.

      This load metric will report a load of 0% when the server is the active node and 100% when the server is the passive node.

      As a result, the reverse proxy will redirect all requests to the active node.

       

      Snowdrop is used to access the JBoss MBean server (http://docs.jboss.org/snowdrop/4.0.0.Final-docs/SnowdropGuide.html#_accessing_the_default_jboss_mbean_server).

      Spring JMX integration bean is used to expose load metrics into the JBoss MBean server (http://docs.spring.io/autorepo/docs/spring/3.1.x/spring-framework-reference/html/jmx.html)

       

      WildFly 9.0.1.Final standalone.xml:

      <subsystem xmlns="urn:jboss:domain:modcluster:2.0">
          <mod-cluster-config proxies="modcluster-proxy-main modcluster-proxy-backup" balancer="${jboss.modcluster.balancer}" advertise="false" sticky-session="true" ping="300" load-balancing-group="${jboss.modcluster.balancer}" connector="default">
              <dynamic-load-provider history="0">
                  <custom-load-metric class="org.jboss.modcluster.load.metric.impl.MBeanAttributeLoadMetric">
                      <property name="pattern" value="example:name=modClusterMasterSlaveLoadMetric"/>
                      <property name="attribute" value="MasterSlaveLoad"/>
                  </custom-load-metric>
              </dynamic-load-provider>
          </mod-cluster-config>
      </subsystem>
      
      

       

      modClusterMasterSlaveLoadMetric is an MBean exported by application at startup.
      It is a simple Java class with a boolean method:

      /**
       * 
       * @return a load of 1 if the node is the master, and 100 otherwise.
       */
      @ManagedAttribute(description = "The load is 1 if the node is the master, and 100 otherwise")
      public int getMasterSlaveLoad() {
          if (clusteredSingletonRunner.isMaster()) {
              logger.debug("ModClusterMasterSlaveLoadMetric reports the node as being in master mode");
              return 1;
          } else {
              logger.debug("ModClusterMasterSlaveLoadMetric reports the node as being in slave mode (stand-by)");
              return 100;
          }
      }
      
      

       

      It does not work possibly for 2 reasons:
      1) race condition within mod_cluster: at startup mod_cluster tries to get the status of the cluster before the application is fully deployed,
      2) for some reason mod_cluster redirect to the wrong node in 1% of the requests.

       

      Please advice solution that will support fail-over and failback.

       

      Thanks in advance.