1 Reply Latest reply on May 7, 2013 5:17 AM by sannegrinovero

    Proper cluster behavior on machine failure

    moia

      In our enviroment we run an infinispan cluster on two machines.

      This is a distribute sync cluster (ISPN 5.1.6) and we access it with hotrod protocol.

      Unfortunately, from time to time, one machine suffers from some hardware issue (which is being investigated, but I doubt there will be a promt solution). When this happens, all processes become extremely slow as they wait very long on IO operations. Because of this most calls to the cluster end up with

      org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key...

      exception.

       

      What I am looking for is a configuration, where on such ocasion, nodes on malfunctioning machine would be considered dead and be removed from the cluster. Is this possible? Is this infinispan or rather jgroups configuration issue?

      If not, I am considering setting up node monitoring mechanism, which would discover faulty nodes and kill them. Any suggestions on what to monitor to discover such situation? I am thinking of jmx, but which mbean and which method would be the best?

       

      Best regards,

      Mikolaj

        • 1. Re: Proper cluster behavior on machine failure
          sannegrinovero

          Hi,

          it's JGroups who has control of which nodes are in/out the group, these are called failure detection protocols. The ones I'm aware of however don't trigger on such exceptions as JGroups might not be aware of what's going on at the higher layer. You could extend one of JGroups's FD protocols, then catch such TimeoutExceptions and if they happen more than some threshold you could grab a reference to your custom FD protocol and force it to kick the bad bahaving node out of the group.

           

          Totally unrelated: assuming your "hardware problem" might be causedby very high stress, it might be useful to try upgrading to Infinispan 5.2.6.Final as it performs a significantly lower amount of lock operations, so might be more "gentle" to your hardware and avoid the problem altogether.