JGroupsFD

Version 14

Created by ovidiu.feodorov on Jan 11, 2005 11:28 PM. Last modified by ovidiu.feodorov on Dec 31, 2010 6:07 PM.

Definition

Failure detection based on heartbeat messages. A member sends 'are-you-alive' messages with a periodicity of 'timeout' milliseconds. After the first missing heartbeat response, the initiating member send more 'max_tries' heartbeat messages and the target member is declared suspect only after all heartbeat messages go unanswered.

In the worst case, when the target member dies immediately after answering a heartbeat, the failure takes timeout + timeout + max_tries * timeout = (max_tries + 2) * timeout milliseconds to detect.

Once a member is declared suspected it will be excluded by GMS. SUSPECT event handling is also subject to interaction with VERIFY_SUSPECT. If we use FD_SOCK instead, then we don't send heartbeats, but establish TCP sockets and declare a member dead only when a socket is closed.

Configuration Example

    <FD timeout="2000" max_tries="3" shun="true"></FD>

Configuration Parameters

Name	Description
id	Give the protocol a different ID if needed so we can have multiple instances of it in the same stack
level	Sets the logger level (see javadocs)
max_tries	Number of times to send an are-you-alive message
name	Give the protocol a different name if needed so we can have multiple instances of it in the same stack
stats	Determines whether to collect statistics (and expose them via JMX). Default is true
timeout	Timeout to suspect a node P if neither a heartbeat nor data were received from P. Default is 3000 msec

Advanced

Each member send a message containing a "FD" - HEARTBEAT header to its neighbor to the right (identified by the address). The heartbeats are sent by the inner class

When the neighbor receives the HEARTBEAT, it replies with a message containing a "FD" - HEARTBEAT_ACK header. The first member watches for "FD" - HEARTBEAT_ACK replies from its neigbor. For each received reply, it resets the timestamp (sets it to current time) and counter (sets it to 0).

The same instance that sends heartbeats whatches the difference between current time and . If this difference grows over , the cycles several more times (until ) is reached) and then sends a SUSPECT message for the neighbor's address. The SUSPECT message is sent down the stack, is addressed to all members, and is as a regular message with a header.

Cause of missing heartbeats in FD

Sometimes a member is suspected by FD because a hearbeat ack has not been received for some time T (defined by timeout and max_tries). This can have multiple reasons, e.g. in a cluster of A,B,C,D; C can be suspected if (note that A pings B, B pings C, C pings D and D pings A):

B or C are running at 100% CPU for more than T seconds. So even if C sends a heartbeat ack to B, B may not be able to process it because it is at 100%
B or C and garbage collecting, same as above.
A combination of the 2 cases above
The network loses packets. This usually happens when there is a lot of traffic on the network, and the switch starts dropping packets (usually broadcasts first, then IP multicasts, TCP packets last).
B or C are processing a callback. Let's say C received a remote method call (e.g. via RpcDispatcher), and takes T+1 seconds to process it. During this time, C will not process any other messages, including heartbeats, and therfore B will not receive the heartbeat ack and suspect C. This will change in JGroups 2.5 with the threadless stack, out-of-band messages and priority messages. As a workaround for the time being, consider running long tasks in a callback on a separate thread

For more details refer to Failure Detection

Back To JGroups

JBossDeveloper