I've got a scenario where one of the nodes in my cluster fails. Specifically it fails when it runs out of heap space. Obviously that needs to be solved but this was a perfect opportunity for mod_cluster to handle a node failure and my installation/configuration fell flat on its face. What we saw is that as soon as the "bad node" failed, it poisoned the entire cluster's load balancing so that no requests were succeeding. I'll share my current configuration below. I've got some additional ideas to try out as well but I'd be interested if anyone else has seen a similar situation and even more interested in how you solved it.
- mod_cluster 1.0.3.GA on load balancer and on 2 JBoss.org 5.1 nodes.
- Using HAModClusterService - Note that "bad node" did contain the elected HA singleton
- Using DynamicLoadBalanceFactorProvider
I plan on trying a few things and then forcefully reproducing the problem.
- Switch to the non-HA service configuration.
- Add the heap space usage load balancing factor.
- Add some other load balancing factor which would fail when the server's web services become unresponsive.
I don't know that switching to a non-HA configuration will solve the problem because under my scenario HA singleton fail-over is occurring but the original master node is so brain dead that it doesn't realize HA fail-over has occurred. I don't know if the load balancing factor providers are able to continue on the dead node but at the very least MCMP commands continue to flow even though the node is unable to respond to web requests.
I can certainly avoid this specific problem with heap space but what I'd really like is to ensure that the services I expect the nodes to provide are actually working and route traffic based on that before any other factor. Obviously the reason I'm using mod_cluster is because I want my JBoss nodes to provide HTTP communications so thats the service I'd like to test as my primary load metric. The trick is to either create a new load balance factor provider or to see if I can use the JMX provider to query the ability of my server to respond to HTTP requests.