1 Reply Latest reply on May 20, 2015 9:21 AM by dan.berindei

JGroups fail to merge two nodes

rafalr May 19, 2015 7:50 AM

I have two nodes with Infinispan cache working in REPL_ASYNC mode. One of the nodes sometimes receives new view for the channel, afterwards jgroups tries to merge but instead of two nodes the MergedView contains subgroups. This can happen multiple times, new MergedView views contain 2, 4, 6 etc nodes.

Below there are parts of the logs from two instances. Invalid merge happened at 2015-05-08 02:03, 2015-05-18 02:03 and 2015-05-18 09:41 . I've also pasted jgroups configuration in the project.

Please help me to determine cause of this behaviour.

### NODE A

2015-05-08 02:03:51,095 WARN [org.jgroups.protocols.FD] nodeA: I was suspected by nodeB; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

2015-05-08 02:03:57,266 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|3] (2) [nodeB, nodeA], 2 subgroups: [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

2015-05-08 02:03:57,270 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

2015-05-08 02:03:57,270 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

2015-05-08 02:03:57,270 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

2015-05-08 02:04:02,275 WARN [org.jgroups.protocols.pbcast.GMS] nodeA: failed to collect all ACKs (expected=1) for view [nodeB|3] after 5000ms, missing 1 ACKs from (1) nodeB

2015-05-08 09:06:04,947 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

2015-05-08 09:06:04,949 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

2015-05-08 09:06:04,949 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

...

2015-05-13 02:03:31,152 WARN [org.jgroups.protocols.FD] nodeA: I was suspected by nodeB; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

2015-05-13 02:03:59,329 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|5] (2) [nodeB, nodeA], 4 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

2015-05-13 02:03:59,336 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

2015-05-13 02:03:59,336 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

2015-05-13 02:03:59,336 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

2015-05-13 02:04:04,331 WARN [org.jgroups.protocols.pbcast.GMS] nodeA: failed to collect all ACKs (expected=1) for view [nodeB|5] after 5000ms, missing 1 ACKs from (1) nodeB

2015-05-13 10:07:26,446 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

2015-05-13 10:07:26,446 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

2015-05-13 10:07:26,446 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

2015-05-13 10:11:00,729 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] ISPN000136: Execution error

org.infinispan.util.concurrent.TimeoutException: Node nodeB timed out

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:195)

at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:566)

at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:283)

at org.infinispan.interceptors.distribution.BaseDistributionInterceptor.handleNonTxWriteCommand(BaseDistributionInterceptor.java:273)

at org.infinispan.interceptors.distribution.NonTxDistributionInterceptor.visitPutKeyValueCommand(NonTxDistributionInterceptor.java:93)

...

at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:251)

...

Caused by: org.jgroups.TimeoutException: timeout waiting for response from nodeB, request: org.jgroups.blocks.UnicastRequest@6513524f, req_id=211, mode=GET_ALL, target=nodeB

at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:429)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:374)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:188)

... 90 more

...

2015-05-18 09:40:38,141 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeA|6] (1) [nodeA]

2015-05-18 09:41:48,976 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|7] (2) [nodeB, nodeA], 6 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|6] (1) [nodeA], [nodeB|6] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

2015-05-18 09:41:48,982 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

2015-05-18 09:41:48,982 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

2015-05-18 09:41:48,982 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

### NODE B

2015-05-08 02:02:38,748 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeB|2] (1) [nodeB]

2015-05-08 02:03:57,257 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|3] (2) [nodeB, nodeA], 2 subgroups: [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

2015-05-08 02:07:57,263 ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] ISPN000196: Failed to recover cluster state after the current node became the coordinator

java.util.concurrent.TimeoutException

at java.util.concurrent.FutureTask.get(FutureTask.java:205)

at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:432)

...

2015-05-18 09:40:38,140 WARN [org.jgroups.protocols.FD] nodeB: I was suspected by nodeA; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

2015-05-18 09:40:38,610 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeB|6] (1) [nodeB]

2015-05-13 02:02:52,288 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeB|4] (1) [nodeB]

2015-05-13 02:03:59,331 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|5] (2) [nodeB, nodeA], 4 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

2015-05-13 02:07:59,333 ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] ISPN000196: Failed to recover cluster state after the current node became the coordinator

java.util.concurrent.TimeoutException

...

2015-05-18 09:41:48,978 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|7] (2) [nodeB, nodeA], 6 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|6] (1) [nodeA], [nodeB|6] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

2015-05-18 09:45:48,981 ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] ISPN000196: Failed to recover cluster state after the current node became the coordinator

java.util.concurrent.TimeoutException

at java.util.concurrent.FutureTask.get(FutureTask.java:205)

### JGROUPS CONFIG

<?xml version="1.0" encoding="UTF-8"?>

<config xmlns="urn:org:jgroups"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-3.6.xsd">

<TCP bind_addr="${cluster.replication.bindAddr}"

bind_port="${cluster.replication.bindPort}"

external_addr="${cluster.replication.externalAddr}"

external_port="${cluster.replication.externalPort}"

thread_pool.enabled="true"

thread_pool.min_threads="4"

thread_pool.max_threads="16"

thread_pool.keep_alive_time="8000"

thread_pool.queue_enabled="false"

thread_pool.queue_max_size="100"

thread_pool.rejection_policy="run"

oob_thread_pool.enabled="true"

oob_thread_pool.min_threads="4"

oob_thread_pool.max_threads="16"

oob_thread_pool.keep_alive_time="8000"

oob_thread_pool.queue_enabled="false"

oob_thread_pool.queue_max_size="100"

oob_thread_pool.rejection_policy="run"

recv_buf_size="20000000"

send_buf_size="640000"

max_bundle_size="64000"

max_bundle_timeout="30"

use_send_queues="true"

sock_conn_timeout="300" />

<TCPPING initial_hosts="${cluster.replication.discovery.initialHosts}"

port_range="0"/>

<FD timeout="2000"

max_tries="3"/>

<FD_SOCK bind_addr="${cluster.replication.bindAddr}"

start_port="${cluster.replication.failureDetection.bindPort}" port_range="0"

external_addr="${cluster.replication.externalAddr}"

external_port="${cluster.replication.failureDetection.externalPort}"/>

<VERIFY_SUSPECT timeout="1500"/>

<pbcast.NAKACK

use_mcast_xmit="false"

retransmit_timeout="300,600,1200,2400,4800"

discard_delivered_msgs="false"/>

<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400k"/>

<pbcast.GMS print_local_addr="true"

join_timeout="3000"

view_bundling="true"

view_ack_collection_timeout="5000" />

<pbcast.STATE_TRANSFER/>

</config>

1. Re: JGroups fail to merge two nodes

dan.berindei May 20, 2015 9:21 AM (in response to rafalr)

Looks like some messages are ignored:

2015-05-08 02:03:57,270 WARN [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

2015-05-08 02:03:57,270 WARN [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

2015-05-08 02:03:57,270 WARN [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

Since you don't have UNICAST3 in the stack, those discarded messages are not retransmitted. I think adding UNICAST3 should fix your problem, if not, you may have encountered an ENCRYPT issue.

BTW, your protocol list looks pretty old (e.g. NAKACK, BARRIER, STATE_TRANSFER), you should definitely try to rebase on a more recent Infinispan default configuration.
Actions