Definition
Failure detection based on heartbeat messages. A member sends 'are-you-alive' messages with a periodicity of 'timeout' milliseconds. After the first missing heartbeat response, the initiating member send more 'max_tries' heartbeat messages and the target member is declared suspect only after all heartbeat messages go unanswered.
In the worst case, when the target member dies immediately after answering a heartbeat, the failure takes timeout + timeout + max_tries * timeout = (max_tries + 2) * timeout milliseconds to detect.
Once a member is declared suspected it will be excluded by GMS. SUSPECT event handling is also subject to interaction with VERIFY_SUSPECT. If we use FD_SOCK instead, then we don't send heartbeats, but establish TCP sockets and declare a member dead only when a socket is closed.
Configuration Example
<FD timeout="2000" max_tries="3" shun="true"></FD>
Configuration Parameters
Name | Description |
---|---|
id | Give the protocol a different ID if needed so we can have multiple instances of it in the same stack |
level | Sets the logger level (see javadocs) |
max_tries | Number of times to send an are-you-alive message |
name | Give the protocol a different name if needed so we can have multiple instances of it in the same stack |
stats | Determines whether to collect statistics (and expose them via JMX). Default is true |
timeout | Timeout to suspect a node P if neither a heartbeat nor data were received from P. Default is 3000 msec |
See also Protocol Configuration Common Parameters.
Advanced
Each member send a message containing a "FD" - HEARTBEAT header to its neighbor to the right (identified by the address). The heartbeats are sent by the inner class
When the neighbor receives the HEARTBEAT, it replies with a message containing a "FD" - HEARTBEAT_ACK header. The first member watches for "FD" - HEARTBEAT_ACK replies from its neigbor. For each received reply, it resets the timestamp (sets it to current time) and counter (sets it to 0).
The same instance that sends heartbeats whatches the difference between current time and . If this difference grows over , the cycles several more times (until ) is reached) and then sends a SUSPECT message for the neighbor's address. The SUSPECT message is sent down the stack, is addressed to all members, and is as a regular message with a header.
Cause of missing heartbeats in FD
Sometimes a member is suspected by FD because a hearbeat ack has not been received for some time T (defined by timeout and max_tries). This can have multiple reasons, e.g. in a cluster of A,B,C,D; C can be suspected if (note that A pings B, B pings C, C pings D and D pings A):
B or C are running at 100% CPU for more than T seconds. So even if C sends a heartbeat ack to B, B may not be able to process it because it is at 100%
B or C and garbage collecting, same as above.
A combination of the 2 cases above
The network loses packets. This usually happens when there is a lot of traffic on the network, and the switch starts dropping packets (usually broadcasts first, then IP multicasts, TCP packets last).
B or C are processing a callback. Let's say C received a remote method call (e.g. via RpcDispatcher), and takes T+1 seconds to process it. During this time, C will not process any other messages, including heartbeats, and therfore B will not receive the heartbeat ack and suspect C. This will change in JGroups 2.5 with the threadless stack, out-of-band messages and priority messages. As a workaround for the time being, consider running long tasks in a callback on a separate thread
For more details refer to Failure Detection
Comments