10 Replies Latest reply on Apr 28, 2010 10:15 AM by pferraro

    mod_cluster detection of server failure

    akarl

      I've got a scenario where one of the nodes in my cluster fails.  Specifically it fails when it runs out of heap space.  Obviously that needs to be solved but this was a perfect opportunity for mod_cluster to handle a node failure and my installation/configuration fell flat on its face.  What we saw is that as soon as the "bad node" failed, it poisoned the entire cluster's load balancing so that no requests were succeeding.  I'll share my current configuration below.  I've got some additional ideas to try out as well but I'd be interested if anyone else has seen a similar situation and even more interested in how you solved it.

       

      • mod_cluster 1.0.3.GA on load balancer and on 2 JBoss.org 5.1 nodes.
      • Using HAModClusterService - Note that "bad node" did contain the elected HA singleton
      • Using DynamicLoadBalanceFactorProvider
        • AverageSystemLoadMetric
        • BusyConnectorsLoadMetric

       

      I plan on trying a few things and then forcefully reproducing the problem.

      1. Switch to the non-HA service configuration.
      2. Add the heap space usage load balancing factor.
      3. Add some other load balancing factor which would fail when the server's web services become unresponsive.

       

       

      I don't know that switching to a non-HA configuration will solve the problem because under my scenario HA singleton fail-over is occurring but the original master node is so brain dead that it doesn't realize HA fail-over has occurred.  I don't know if the load balancing factor providers are able to continue on the dead node but at the very least MCMP commands continue to flow even though the node is unable to respond to web requests.

       

      I can certainly avoid this specific problem with heap space but what I'd really like is to ensure that the services I expect the nodes to provide are actually working and route traffic based on that before any other factor.  Obviously the reason I'm using mod_cluster is because I want my JBoss nodes to provide HTTP communications so thats the service I'd like to test as my primary load metric.  The trick is to either create a new load balance factor provider or to see if I can use the JMX provider to query the ability of my server to respond to HTTP requests.

        • 1. Re: mod_cluster detection of server failure
          brettcave

          Definitely seems to not work as expected.

           

          I am using 1.1.0.CR1. 2 AS nodes are added to 1 httpd.

          kill -9 <pid of jboss>

           

          manager shows:

          Dead node: Balancer: mycluster,Domain: ,Flushpackets: Off,Flushwait: 10000,Ping: 10000000,Smax: 1,Ttl: 60000000,Status: NOTOK,Elected: 0,Read: 0,Transferred: 0,Connected: 0,Load: 100

          Alive node: Balancer: mycluster,Domain: ,Flushpackets: Off,Flushwait: 10000,Ping: 10000000,Smax: 1,Ttl: 60000000,Status: OK,Elected: 0,Read: 0,Transferred: 0,Connected: 0,Load: -1

           

          My session was initially to Dead node, the routing is not updated, and browsing via httpd results in 503 errors.

           

          If the active node is shut down cleanly (e.g. jboss shutdown scripts), it is successfully de-registered from the httpd mod_cluster.

          • 2. Re: mod_cluster detection of server failure
            brettcave

            The dead node was just removed from the config, after about 5 minutes.

             

            However, I am still routed to the dead node. Going to adjust stickiness settings, as the backend servers are replicating session (force_stickiness off)

            • 3. Re: mod_cluster detection of server failure
              pferraro

              Adam Karl wrote:

               

              I've got a scenario where one of the nodes in my cluster fails.  Specifically it fails when it runs out of heap space.  Obviously that needs to be solved but this was a perfect opportunity for mod_cluster to handle a node failure and my installation/configuration fell flat on its face.  What we saw is that as soon as the "bad node" failed, it poisoned the entire cluster's load balancing so that no requests were succeeding.  I'll share my current configuration below.  I've got some additional ideas to try out as well but I'd be interested if anyone else has seen a similar situation and even more interested in how you solved it.

               

              • mod_cluster 1.0.3.GA on load balancer and on 2 JBoss.org 5.1 nodes.
              • Using HAModClusterService - Note that "bad node" did contain the elected HA singleton
              • Using DynamicLoadBalanceFactorProvider
                • AverageSystemLoadMetric
                • BusyConnectorsLoadMetric

               

              I plan on trying a few things and then forcefully reproducing the problem.

              1. Switch to the non-HA service configuration.
              2. Add the heap space usage load balancing factor.
              3. Add some other load balancing factor which would fail when the server's web services become unresponsive.

               

               

              I don't know that switching to a non-HA configuration will solve the problem because under my scenario HA singleton fail-over is occurring but the original master node is so brain dead that it doesn't realize HA fail-over has occurred.  I don't know if the load balancing factor providers are able to continue on the dead node but at the very least MCMP commands continue to flow even though the node is unable to respond to web requests.

               

              I can certainly avoid this specific problem with heap space but what I'd really like is to ensure that the services I expect the nodes to provide are actually working and route traffic based on that before any other factor.  Obviously the reason I'm using mod_cluster is because I want my JBoss nodes to provide HTTP communications so thats the service I'd like to test as my primary load metric.  The trick is to either create a new load balance factor provider or to see if I can use the JMX provider to query the ability of my server to respond to HTTP requests.

               

              Do you see the following INFO message in the log of your master node?

              "Removing jvm route [...] from proxy [...] on behalf of crashed member: ..."

               

              This should not be the case - even if your HA singleton master node thinks the other node left the cluster (a false positive, due to its own OOM), this should not cause the healthy node to be removed from the load balancer, i.e. you should still be servicing requests on that node.

               

              I would expect to see both nodes continuing to service requests - although the requests routed to the OOM'ed master node will return 50x responses.  After a period of time, I would expect the master node to timeout of the healthy node's jgroups membership - and assume master status.  This should trigger the removal of the old master node, since it would fail a PING mcmp command.

               

              Can you post any INFO+ messages in your log coming from either mod_cluster or jgroups?

               


              Since your application is clearly stressing your heap, I would encourage you to use a heavily weighted HeapMemoryUsageLoadMetric.

              • 4. Re: mod_cluster detection of server failure
                pferraro

                Paul Ferraro wrote:

                 

                Adam Karl wrote:

                 

                I've got a scenario where one of the nodes in my cluster fails.  Specifically it fails when it runs out of heap space.  Obviously that needs to be solved but this was a perfect opportunity for mod_cluster to handle a node failure and my installation/configuration fell flat on its face.  What we saw is that as soon as the "bad node" failed, it poisoned the entire cluster's load balancing so that no requests were succeeding.  I'll share my current configuration below.  I've got some additional ideas to try out as well but I'd be interested if anyone else has seen a similar situation and even more interested in how you solved it.

                 

                • mod_cluster 1.0.3.GA on load balancer and on 2 JBoss.org 5.1 nodes.
                • Using HAModClusterService - Note that "bad node" did contain the elected HA singleton
                • Using DynamicLoadBalanceFactorProvider
                  • AverageSystemLoadMetric
                  • BusyConnectorsLoadMetric

                 

                I plan on trying a few things and then forcefully reproducing the problem.

                1. Switch to the non-HA service configuration.
                2. Add the heap space usage load balancing factor.
                3. Add some other load balancing factor which would fail when the server's web services become unresponsive.

                 

                 

                I don't know that switching to a non-HA configuration will solve the problem because under my scenario HA singleton fail-over is occurring but the original master node is so brain dead that it doesn't realize HA fail-over has occurred.  I don't know if the load balancing factor providers are able to continue on the dead node but at the very least MCMP commands continue to flow even though the node is unable to respond to web requests.

                 

                I can certainly avoid this specific problem with heap space but what I'd really like is to ensure that the services I expect the nodes to provide are actually working and route traffic based on that before any other factor.  Obviously the reason I'm using mod_cluster is because I want my JBoss nodes to provide HTTP communications so thats the service I'd like to test as my primary load metric.  The trick is to either create a new load balance factor provider or to see if I can use the JMX provider to query the ability of my server to respond to HTTP requests.

                 

                Do you see the following INFO message in the log of your master node?

                "Removing jvm route [...] from proxy [...] on behalf of crashed member: ..."

                 

                This should not be the case - even if your HA singleton master node thinks the other node left the cluster (a false positive, due to its own OOM), this should not cause the healthy node to be removed from the load balancer, i.e. you should still be servicing requests on that node.

                 

                I would expect to see both nodes continuing to service requests - although the requests routed to the OOM'ed master node will return 50x responses.  After a period of time, I would expect the master node to timeout of the healthy node's jgroups membership - and assume master status.  This should trigger the removal of the old master node, since it would fail a PING mcmp command.

                Actually, I just saw that you are using version 1.0.3.GA.  In 1.0.x, the OOM'ed node won't be removed until the node itself is killed.

                Any chance you can upgrade to 1.1.0.CR1?

                • 5. Re: mod_cluster detection of server failure
                  pferraro

                  Brett Cave wrote:

                   

                  The dead node was just removed from the config, after about 5 minutes.

                   

                  However, I am still routed to the dead node. Going to adjust stickiness settings, as the backend servers are replicating session (force_stickiness off)

                  If you're using mod_cluster's HAModClusterService and its HAPartition's jgroups stack uses FD_SOCK (it will by default), the removal should be fairly immediate.

                   

                  If you find that httpd is still routing requests to a node whose status is NOTOK, then please file a bug report.

                  • 6. Re: mod_cluster detection of server failure
                    brettcave

                    Will log a bug.

                     

                    Have tried 2 configuration sets for the listener, 1st one was all defaults and 2nd one (configured in server.xml):

                     

                    advertise="false"

                    socketTimeout="5000"

                    stopContextTimeout="5"

                    maxAttempts="10"

                    stickySessionForce="false"

                    balancer="loadbalancer"

                    workerTimeout="30"

                    nodeTimeout="10"

                     

                     

                    The dead node stays registered with httpd for 5 minutes after being killed, with a status of NOTOK.

                     

                    If JBoss is shut down cleanly, it is removed within 10 seconds. enabled contexts are disabled as the associated application is shut down.

                     

                    The same behaviour takes place with stickySessionRemove set to true (although session should be maintained on the client but routed to the new backend node).

                    • 7. Re: mod_cluster detection of server failure
                      brettcave
                      • 8. Re: mod_cluster detection of server failure
                        akarl

                        Going back to the logs...

                         

                        Master App Server Node

                        - No modcluster logging at the time of the problem but ~15 minutes after the heap ran out I see the following

                        "2010-04-21 08:20:33,592 INFO  [org.jboss.modcluster.mcmp.impl.DefaultMCMPHandler] IO error sending command STATUS to proxy 10.183.2.33:10000"

                        - No logging of "new cluster view"

                        - There was plenty of JGroups logging at the time of the problem.  It certainly detected the issue but on the master node was not able to establish a new cluster view.

                         

                        Non-Master App Server Node

                        - Nearly immediately following the heap exception "new cluster view" removing the old master.

                         

                        Load Balancer

                        - Errors don't begin until ~30 minutes after the issue.

                         

                         

                        I am in the process of updating to 1.1 CR1 which has some other bug fixes I was looking for anyway.  I'll re-test my scenario there.  I am also adding the heap space metric however in this case heap space usage was high on both nodes due to a background thread maintaining more heap than was expected on all nodes.  I'll update this thread with new test results once I get the chance to kill a node off

                        • 9. Re: mod_cluster detection of server failure
                          brettcave

                          If I remember correctly, heap is one of the default measured metrics for determining load in 1.1.0 cr1

                          • 10. Re: mod_cluster detection of server failure
                            pferraro

                            Yes - by default the BusyConnectorsLoadMetric and HeapUsageLoadMetric are used with respective weight ratios of 2:3 and 1:3.