Cluster Stability Issues in a Replicated Cache High Availability Use Case
cbo_ Jun 30, 2010 9:57 PMHi,
We have a use case involving high availability between 2 VMs operating on 2 separate machines and using a replication cache. We have moved into some more advanced testing and are discovering some issues with the stability of the jgroups cluster.
The first issue occurred while cluster coordination was switching between the 2 VMs as we simulated bringing VMs down and back up. The 2 VMs appeared to cooperate from one direction (while one was the coordinator), but not when the other became coordinator. We did not notice at first, but later realized in our use of TCPPING for cluster discovery we had the IP address incorrect on one side in our jgroups file. We recently re-IP'd this machine and forgot to go back and fix this since we had been using MPING before. The main symptom here was that we were only successful if we first started the VM that had the correct TCPPING setting for the other VM (but not visa versa). Once we fixed this we were able to start the VMs in either order and the cluster joins occurred correctly. Then, we noticed we had the setting for num_initial_members set to 1 which kept us from being able to switch cluster coordinator back and forth and keep the 2 VMs in the cluster (one VM would choose to isolate itself). Set it to 2 and the cluster coordination and merging was working well.
Until......
We started to push a lot of work to our primary VM. At some point that VM runs out of memory (old gen) and gets into at least one lengthy garbage collection (GC). The logs seem to indicate that this was due to pings from the other VM not being responded to promptly by the VM stuck for a time in GC. We modified our logging and set the level to TRACE for the below mentioned classes. This logging details supported our theory. The theory being that the efforts to ping from one VM to the VM in GC are beyond tolerance levels and the VM then isolates itself and becomes (in its mind) the coordinator. The concern we have is that it never recovers from this scenario despite the GC eventually finishing a bit later. The (temp) solution to this issue was to set the ping tolerances above the duration of the expected "outage" from things like GC. Other concerns would be network issues. The ping settings that were changed are:
<FD timeout="15000" max_tries="3"/>
Any thoughts on why the cluster can never recover once the separation occurs? The logs are pretty cumbersome right now and we are sifting through and may add those details tomorrow. For now we can abbreviate the log details as follows (B is one VM and A indicates the other VM):
1. B maintains a view of A and B.
2. During the GC, A declares B to be dead.
3. After the GC, a MERGE is requested.
4. A declares itself the MERGE leader.
5. The FLUSH fails because B is discarding the request since it is not in A’s view.
6. The MERGE fails.
And, as mentioned above, here are the classes that we set logging to TRACE while investigating this:
Just in case it proves useful to anyone, the complete jgroups xml now has the following contents:
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">
<TCP bind_port="7820"
loopback="false"
port_range="30"
recv_buf_size="20000000"
send_buf_size="640000"
discard_incompatible_packets="true"
max_bundle_size="64000"
max_bundle_timeout="30"
enable_bundling="false"
tcp_nodelay="true"
use_send_queues="true"
sock_conn_timeout="300"
enable_diagnostics="false"
thread_pool.enabled="true"
thread_pool.min_threads="2"
thread_pool.max_threads="30"
thread_pool.keep_alive_time="5000"
thread_pool.queue_enabled="false"
thread_pool.queue_max_size="100"
thread_pool.rejection_policy="Discard"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="2"
oob_thread_pool.max_threads="30"
oob_thread_pool.keep_alive_time="5000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="Discard"
/>