JGroups fail to merge two nodes
rafalr May 19, 2015 7:50 AMI have two nodes with Infinispan cache working in REPL_ASYNC mode. One of the nodes sometimes receives new view for the channel, afterwards jgroups tries to merge but instead of two nodes the MergedView contains subgroups. This can happen multiple times, new MergedView views contain 2, 4, 6 etc nodes.
Below there are parts of the logs from two instances. Invalid merge happened at 2015-05-08 02:03, 2015-05-18 02:03 and 2015-05-18 09:41 . I've also pasted jgroups configuration in the project.
Please help me to determine cause of this behaviour.
### NODE A
2015-05-08 02:03:51,095 WARN [org.jgroups.protocols.FD] nodeA: I was suspected by nodeB; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2015-05-08 02:03:57,266 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|3] (2) [nodeB, nodeA], 2 subgroups: [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]
2015-05-08 02:03:57,270 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version
2015-05-08 02:03:57,270 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map
2015-05-08 02:03:57,270 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message
2015-05-08 02:04:02,275 WARN [org.jgroups.protocols.pbcast.GMS] nodeA: failed to collect all ACKs (expected=1) for view [nodeB|3] after 5000ms, missing 1 ACKs from (1) nodeB
2015-05-08 09:06:04,947 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version
2015-05-08 09:06:04,949 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map
2015-05-08 09:06:04,949 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message
...
2015-05-13 02:03:31,152 WARN [org.jgroups.protocols.FD] nodeA: I was suspected by nodeB; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2015-05-13 02:03:59,329 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|5] (2) [nodeB, nodeA], 4 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]
2015-05-13 02:03:59,336 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version
2015-05-13 02:03:59,336 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map
2015-05-13 02:03:59,336 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message
2015-05-13 02:04:04,331 WARN [org.jgroups.protocols.pbcast.GMS] nodeA: failed to collect all ACKs (expected=1) for view [nodeB|5] after 5000ms, missing 1 ACKs from (1) nodeB
2015-05-13 10:07:26,446 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version
2015-05-13 10:07:26,446 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map
2015-05-13 10:07:26,446 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message
2015-05-13 10:11:00,729 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] ISPN000136: Execution error
org.infinispan.util.concurrent.TimeoutException: Node nodeB timed out
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:195)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:566)
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:283)
at org.infinispan.interceptors.distribution.BaseDistributionInterceptor.handleNonTxWriteCommand(BaseDistributionInterceptor.java:273)
at org.infinispan.interceptors.distribution.NonTxDistributionInterceptor.visitPutKeyValueCommand(NonTxDistributionInterceptor.java:93)
...
at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:251)
...
Caused by: org.jgroups.TimeoutException: timeout waiting for response from nodeB, request: org.jgroups.blocks.UnicastRequest@6513524f, req_id=211, mode=GET_ALL, target=nodeB
at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:429)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:374)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:188)
... 90 more
...
2015-05-18 09:40:38,141 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeA|6] (1) [nodeA]
2015-05-18 09:41:48,976 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|7] (2) [nodeB, nodeA], 6 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|6] (1) [nodeA], [nodeB|6] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]
2015-05-18 09:41:48,982 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version
2015-05-18 09:41:48,982 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map
2015-05-18 09:41:48,982 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message
### NODE B
2015-05-08 02:02:38,748 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeB|2] (1) [nodeB]
2015-05-08 02:03:57,257 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|3] (2) [nodeB, nodeA], 2 subgroups: [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]
2015-05-08 02:07:57,263 ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] ISPN000196: Failed to recover cluster state after the current node became the coordinator
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:432)
...
2015-05-18 09:40:38,140 WARN [org.jgroups.protocols.FD] nodeB: I was suspected by nodeA; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
2015-05-18 09:40:38,610 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeB|6] (1) [nodeB]
2015-05-13 02:02:52,288 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeB|4] (1) [nodeB]
2015-05-13 02:03:59,331 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|5] (2) [nodeB, nodeA], 4 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]
2015-05-13 02:07:59,333 ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] ISPN000196: Failed to recover cluster state after the current node became the coordinator
java.util.concurrent.TimeoutException
...
2015-05-18 09:41:48,978 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|7] (2) [nodeB, nodeA], 6 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|6] (1) [nodeA], [nodeB|6] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]
2015-05-18 09:45:48,981 ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] ISPN000196: Failed to recover cluster state after the current node became the coordinator
java.util.concurrent.TimeoutException
at java.util.concurrent.FutureTask.get(FutureTask.java:205)
### JGROUPS CONFIG
<?xml version="1.0" encoding="UTF-8"?>
<config xmlns="urn:org:jgroups"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-3.6.xsd">
<TCP bind_addr="${cluster.replication.bindAddr}"
bind_port="${cluster.replication.bindPort}"
external_addr="${cluster.replication.externalAddr}"
external_port="${cluster.replication.externalPort}"
thread_pool.enabled="true"
thread_pool.min_threads="4"
thread_pool.max_threads="16"
thread_pool.keep_alive_time="8000"
thread_pool.queue_enabled="false"
thread_pool.queue_max_size="100"
thread_pool.rejection_policy="run"
oob_thread_pool.enabled="true"
oob_thread_pool.min_threads="4"
oob_thread_pool.max_threads="16"
oob_thread_pool.keep_alive_time="8000"
oob_thread_pool.queue_enabled="false"
oob_thread_pool.queue_max_size="100"
oob_thread_pool.rejection_policy="run"
recv_buf_size="20000000"
send_buf_size="640000"
max_bundle_size="64000"
max_bundle_timeout="30"
use_send_queues="true"
sock_conn_timeout="300" />
<TCPPING initial_hosts="${cluster.replication.discovery.initialHosts}"
port_range="0"/>
<FD timeout="2000"
max_tries="3"/>
<FD_SOCK bind_addr="${cluster.replication.bindAddr}"
start_port="${cluster.replication.failureDetection.bindPort}" port_range="0"
external_addr="${cluster.replication.externalAddr}"
external_port="${cluster.replication.failureDetection.externalPort}"/>
<MERGE2 max_interval="100000" min_interval="20000"/>
<VERIFY_SUSPECT timeout="1500"/>
<BARRIER/>
<pbcast.NAKACK
use_mcast_xmit="false"
retransmit_timeout="300,600,1200,2400,4800"
discard_delivered_msgs="false"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400k"/>
<pbcast.GMS print_local_addr="true"
join_timeout="3000"
view_bundling="true"
view_ack_collection_timeout="5000" />
<ENCRYPT sym_init="56" sym_algorithm="Blowfish" asym_init="512" asym_algorithm="RSA"/>
<FRAG2 frag_size="60k"/>
<pbcast.STATE_TRANSFER/>
</config>