Failure Detection (FD) protocols


    There are essentially 2 configurations for failure detection (FD) in JGroups: FD and FD_SOCK.




    • Based on heartbeats (are-you-alive messages).

    • Each member Q periodically sends an are-you-alive message to its neighbor P.

      • If P doesn't respond within some time, P will be suspected.

    • Q sends a SUSPECT message to the group

    • Only the current coordinator handles the SUSPECT message.

    Usually it double-checks whether P is really dead

    • If still no response from P, the coordinator will exclude P and install a new view (without P)

    (Note that Q does not exclude P, it merely suspects P)


    Configuration of FD

    • timeout: number of milliseconds to wait for a response to are-you-alive

    • max_tries: number of missed are-you-alive responses from P until P is suspected

    • shun: a member will be shunned (see section on shunning)


    Note that regular traffic from P counts as if was a heartbeat




    • Failure detection based on TCP connections between 2 members

    • In membership {A,B,C}, A connects to B, B connects to C, and C connects back to A

    • A SUSPECT message is sent to the group only when the TCP connection is closed abnormally (ie. a crash)

    • If a member is about to leave gracefully, it lets its pinger know, so that no SUSPECT message is sent



    Configuration of FD_SOCK

    • srv_sock_bind_addr (or -Dbind.address): interface to which the server socket should bind to




    Difference between FD and FD_SOCK


    • FD

      • Overloaded machine might be slow in sending are-you-alive responses

      • False suspicions

      • Member will be suspected when suspended in a debugger/profiler

      • Low timeouts lead to higher probability of false suspicions

      • High timeouts will not detect and remove crashed members for some time


    • FD_SOCK

      • Suspended in a debugger is no problem

      • High load no problem either

      • Members will only be suspected when TCP connection breaks

        • So hung members will not be detected

        • Also, a crashed switch will not be detected until the connection runs into the TCP timeout (between 2-20 minutes, depending on TCP/IP stack implementation)



    Enabling TCP KEEP_ALIVE

    This was enabled by default in JGroups 2.3. If enabled, TCP sends a heartbeat on socket on which no traffic has been received

    in 2 hours. So, for FD_SOCK, if a host crashed (or an intermediate switch or router crashed) without closing the

    TCP connection properly, we would detect this after 2 hours (plus a few minutes). This is of course better than never closing the connection (if KEEP_ALIVE is off), but may not be of much help.


    There are 2 solutions: the first one is to lower the timeout value for KEEP_ALIVE. This can only be done for the entire kernel in most operating systems, so if this is lowered to 15 minutes, this will affect all TCP sockets.


    The second solution is to combine FD_SOCK and FD; the timeout in FD can be set such that it is much lower than the TCP timeout, and this can be configured individually per process. In most (regular) cases, FD_SOCK will generate the suspect message because sockets are usually closed normally. However, in the case of a crashed switch or host, FD will make sure the socket is eventually closed and the suspect message generated.



      <FD_SOCK down_thread="false" up_thread="false"></FD_SOCK>
      <FD timeout="10000" max_tries="5" shun="true" down_thread="false" up_thread="false" ></FD>


    This suspects a member when the socket to the neighbor has been closed abonormally (e.g. process crash, because the OS closes all sockets). However, f a host or switch crashes, then the sockets won't be closed, therefore, as a seond line of defense, FD will suspect the neighbor after 50 seconds. Note that with this example, if you have your system stopped in a breakpoint in the debugger, the node you're debugging will be suspected after ca 50 seconds.


    The JIRA issue regarding KEEP_ALIVE is http://jira.jboss.com/jira/browse/JGRP-195.



    Cause of missing heartbeats in FD


    Sometimes a member is suspected by FD because a hearbeat ack has not been received for some time T (defined by timeout and max_tries). This can have multiple reasons, e.g. in a cluster of A,B,C,D; C can be suspected if (note that A pings B, B pings C, C pings D and D pings A):

    • B or C are running at 100% CPU for more than T seconds. So even if C sends a heartbeat ack to B, B may not be able to process it because it is at 100%

    • B or C and garbage collecting, same as above.

    • A combination of the 2 cases above

    • The network loses packets. This usually happens when there is a lot of traffic on the network, and the switch starts dropping packets (usually broadcasts first, then IP multicasts, TCP packets last).

    • B or C are processing a callback. Let's say C received a remote method call (e.g. via RpcDispatcher), and takes T+1 seconds to process it. During this time, C will not process any other messages, including heartbeats, and therfore B will not receive the heartbeat ack and suspect C. This will change in JGroups 2.5 with the threadless stack, out-of-band messages and priority messages. As a workaround for the time being, consider running long tasks in a callback on a separate thread