Failure Detection (FD) protocols
There are essentially 2 configurations for failure detection (FD) in JGroups: FD and FD_SOCK.
FD
Based on heartbeats (are-you-alive messages).
Each member Q periodically sends an are-you-alive message to its neighbor P.
If P doesn't respond within some time, P will be suspected.
Q sends a SUSPECT message to the group
Only the current coordinator handles the SUSPECT message.
Usually it double-checks whether P is really dead
If still no response from P, the coordinator will exclude P and install a new view (without P)
(Note that Q does not exclude P, it merely suspects P)
Configuration of FD
timeout: number of milliseconds to wait for a response to are-you-alive
max_tries: number of missed are-you-alive responses from P until P is suspected
shun: a member will be shunned (see section on shunning)
Note that regular traffic from P counts as if was a heartbeat
FD_SOCK
Failure detection based on TCP connections between 2 members
In membership {A,B,C}, A connects to B, B connects to C, and C connects back to A
A SUSPECT message is sent to the group only when the TCP connection is closed abnormally (ie. a crash)
If a member is about to leave gracefully, it lets its pinger know, so that no SUSPECT message is sent
Configuration of FD_SOCK
srv_sock_bind_addr (or -Dbind.address): interface to which the server socket should bind to
Difference between FD and FD_SOCK
FD
Overloaded machine might be slow in sending are-you-alive responses
False suspicions
Member will be suspected when suspended in a debugger/profiler
Low timeouts lead to higher probability of false suspicions
High timeouts will not detect and remove crashed members for some time
FD_SOCK
Suspended in a debugger is no problem
High load no problem either
Members will only be suspected when TCP connection breaks
So hung members will not be detected
Also, a crashed switch will not be detected until the connection runs into the TCP timeout (between 2-20 minutes, depending on TCP/IP stack implementation)
Enabling TCP KEEP_ALIVE
This was enabled by default in JGroups 2.3. If enabled, TCP sends a heartbeat on socket on which no traffic has been received
in 2 hours. So, for FD_SOCK, if a host crashed (or an intermediate switch or router crashed) without closing the
TCP connection properly, we would detect this after 2 hours (plus a few minutes). This is of course better than never closing the connection (if KEEP_ALIVE is off), but may not be of much help.
There are 2 solutions: the first one is to lower the timeout value for KEEP_ALIVE. This can only be done for the entire kernel in most operating systems, so if this is lowered to 15 minutes, this will affect all TCP sockets.
The second solution is to combine FD_SOCK and FD; the timeout in FD can be set such that it is much lower than the TCP timeout, and this can be configured individually per process. In most (regular) cases, FD_SOCK will generate the suspect message because sockets are usually closed normally. However, in the case of a crashed switch or host, FD will make sure the socket is eventually closed and the suspect message generated.
Example:
<FD_SOCK down_thread="false" up_thread="false"></FD_SOCK> <FD timeout="10000" max_tries="5" shun="true" down_thread="false" up_thread="false" ></FD>
This suspects a member when the socket to the neighbor has been closed abonormally (e.g. process crash, because the OS closes all sockets). However, f a host or switch crashes, then the sockets won't be closed, therefore, as a seond line of defense, FD will suspect the neighbor after 50 seconds. Note that with this example, if you have your system stopped in a breakpoint in the debugger, the node you're debugging will be suspected after ca 50 seconds.
The JIRA issue regarding KEEP_ALIVE is http://jira.jboss.com/jira/browse/JGRP-195.
Cause of missing heartbeats in FD
Sometimes a member is suspected by FD because a hearbeat ack has not been received for some time T (defined by timeout and max_tries). This can have multiple reasons, e.g. in a cluster of A,B,C,D; C can be suspected if (note that A pings B, B pings C, C pings D and D pings A):
B or C are running at 100% CPU for more than T seconds. So even if C sends a heartbeat ack to B, B may not be able to process it because it is at 100%
B or C and garbage collecting, same as above.
A combination of the 2 cases above
The network loses packets. This usually happens when there is a lot of traffic on the network, and the switch starts dropping packets (usually broadcasts first, then IP multicasts, TCP packets last).
B or C are processing a callback. Let's say C received a remote method call (e.g. via RpcDispatcher), and takes T+1 seconds to process it. During this time, C will not process any other messages, including heartbeats, and therfore B will not receive the heartbeat ack and suspect C. This will change in JGroups 2.5 with the threadless stack, out-of-band messages and priority messages. As a workaround for the time being, consider running long tasks in a callback on a separate thread
Comments