1 Reply Latest reply on May 20, 2015 9:21 AM by dan.berindei

    JGroups fail to merge two nodes

    rafalr

      I have two nodes with Infinispan cache working in REPL_ASYNC mode. One of the nodes sometimes receives new view for the channel, afterwards jgroups tries to merge but instead of two nodes the MergedView contains subgroups. This can happen multiple times, new MergedView views contain 2, 4, 6 etc nodes.

       

      Below there are parts of the logs from two instances. Invalid merge happened at 2015-05-08 02:03, 2015-05-18 02:03 and 2015-05-18 09:41 . I've also pasted jgroups configuration in the project.

       

      Please help me to determine cause of this behaviour.

       

      ### NODE A

       

      2015-05-08 02:03:51,095  WARN  [org.jgroups.protocols.FD] nodeA: I was suspected by nodeB; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      2015-05-08 02:03:57,266  INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|3] (2) [nodeB, nodeA], 2 subgroups: [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

      2015-05-08 02:03:57,270  WARN  [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

      2015-05-08 02:03:57,270  WARN  [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

      2015-05-08 02:03:57,270  WARN  [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

      2015-05-08 02:04:02,275  WARN  [org.jgroups.protocols.pbcast.GMS] nodeA: failed to collect all ACKs (expected=1) for view [nodeB|3] after 5000ms, missing 1 ACKs from (1) nodeB

      2015-05-08 09:06:04,947  WARN  [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

      2015-05-08 09:06:04,949  WARN  [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

      2015-05-08 09:06:04,949  WARN  [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

      ...

      2015-05-13 02:03:31,152  WARN  [org.jgroups.protocols.FD] nodeA: I was suspected by nodeB; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      2015-05-13 02:03:59,329  INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|5] (2) [nodeB, nodeA], 4 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

      2015-05-13 02:03:59,336  WARN  [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

      2015-05-13 02:03:59,336  WARN  [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

      2015-05-13 02:03:59,336  WARN  [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

      2015-05-13 02:04:04,331  WARN  [org.jgroups.protocols.pbcast.GMS] nodeA: failed to collect all ACKs (expected=1) for view [nodeB|5] after 5000ms, missing 1 ACKs from (1) nodeB

      2015-05-13 10:07:26,446  WARN  [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

      2015-05-13 10:07:26,446  WARN  [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

      2015-05-13 10:07:26,446  WARN  [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

      2015-05-13 10:11:00,729  ERROR [org.infinispan.interceptors.InvocationContextInterceptor] ISPN000136: Execution error

      org.infinispan.util.concurrent.TimeoutException: Node nodeB timed out

        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:195)

        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:566)

        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:283)

        at org.infinispan.interceptors.distribution.BaseDistributionInterceptor.handleNonTxWriteCommand(BaseDistributionInterceptor.java:273)

        at org.infinispan.interceptors.distribution.NonTxDistributionInterceptor.visitPutKeyValueCommand(NonTxDistributionInterceptor.java:93)

        ...

        at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:251)

        ...

      Caused by: org.jgroups.TimeoutException: timeout waiting for response from nodeB, request: org.jgroups.blocks.UnicastRequest@6513524f, req_id=211, mode=GET_ALL, target=nodeB

        at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:429)

        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:374)

        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:188)

        ... 90 more

      ...

      2015-05-18 09:40:38,141  INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeA|6] (1) [nodeA]

      2015-05-18 09:41:48,976  INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|7] (2) [nodeB, nodeA], 6 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|6] (1) [nodeA], [nodeB|6] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

      2015-05-18 09:41:48,982  WARN  [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

      2015-05-18 09:41:48,982  WARN  [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

      2015-05-18 09:41:48,982  WARN  [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

       

      ### NODE B

       

      2015-05-08 02:02:38,748  INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeB|2] (1) [nodeB]

      2015-05-08 02:03:57,257  INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|3] (2) [nodeB, nodeA], 2 subgroups: [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

      2015-05-08 02:07:57,263  ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] ISPN000196: Failed to recover cluster state after the current node became the coordinator

      java.util.concurrent.TimeoutException

        at java.util.concurrent.FutureTask.get(FutureTask.java:205)

        at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:432)

      ...

      2015-05-18 09:40:38,140  WARN  [org.jgroups.protocols.FD] nodeB: I was suspected by nodeA; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      2015-05-18 09:40:38,610  INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeB|6] (1) [nodeB]

      2015-05-13 02:02:52,288  INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000094: Received new cluster view for channel ISPN: [nodeB|4] (1) [nodeB]

      2015-05-13 02:03:59,331  INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|5] (2) [nodeB, nodeA], 4 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

      2015-05-13 02:07:59,333  ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] ISPN000196: Failed to recover cluster state after the current node became the coordinator

      java.util.concurrent.TimeoutException

      ...

      2015-05-18 09:41:48,978  INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] ISPN000093: Received new, MERGED cluster view for channel ISPN: MergeView::[nodeB|7] (2) [nodeB, nodeA], 6 subgroups: [nodeB|4] (1) [nodeB], [nodeB|3] (2) [nodeB, nodeA], [nodeB|2] (1) [nodeB], [nodeA|6] (1) [nodeA], [nodeB|6] (1) [nodeB], [nodeA|1] (2) [nodeA, nodeB]

      2015-05-18 09:45:48,981  ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] ISPN000196: Failed to recover cluster state after the current node became the coordinator

      java.util.concurrent.TimeoutException

        at java.util.concurrent.FutureTask.get(FutureTask.java:205)

       

      ### JGROUPS CONFIG

       

      <?xml version="1.0" encoding="UTF-8"?>

      <config xmlns="urn:org:jgroups"

              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

              xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/jgroups-3.6.xsd">

        <TCP bind_addr="${cluster.replication.bindAddr}"

             bind_port="${cluster.replication.bindPort}"

             external_addr="${cluster.replication.externalAddr}"

             external_port="${cluster.replication.externalPort}"

       

       

             thread_pool.enabled="true"

             thread_pool.min_threads="4"

             thread_pool.max_threads="16"

             thread_pool.keep_alive_time="8000"

             thread_pool.queue_enabled="false"

             thread_pool.queue_max_size="100"

             thread_pool.rejection_policy="run"

       

       

             oob_thread_pool.enabled="true"

             oob_thread_pool.min_threads="4"

             oob_thread_pool.max_threads="16"

             oob_thread_pool.keep_alive_time="8000"

             oob_thread_pool.queue_enabled="false"

             oob_thread_pool.queue_max_size="100"

             oob_thread_pool.rejection_policy="run"

       

       

             recv_buf_size="20000000"

             send_buf_size="640000"

             max_bundle_size="64000"

             max_bundle_timeout="30"

             use_send_queues="true"

             sock_conn_timeout="300" />

       

       

        <TCPPING initial_hosts="${cluster.replication.discovery.initialHosts}"

                 port_range="0"/>

       

       

        <FD timeout="2000"

            max_tries="3"/>

        <FD_SOCK bind_addr="${cluster.replication.bindAddr}"

                 start_port="${cluster.replication.failureDetection.bindPort}" port_range="0"

                 external_addr="${cluster.replication.externalAddr}"

                 external_port="${cluster.replication.failureDetection.externalPort}"/>

       

       

        <MERGE2 max_interval="100000" min_interval="20000"/>

        <VERIFY_SUSPECT timeout="1500"/>

        <BARRIER/>

       

       

        <pbcast.NAKACK

            use_mcast_xmit="false"

            retransmit_timeout="300,600,1200,2400,4800"

            discard_delivered_msgs="false"/>

        <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000" max_bytes="400k"/>

       

       

        <pbcast.GMS print_local_addr="true"

                    join_timeout="3000"

                    view_bundling="true"

                    view_ack_collection_timeout="5000" />

        <ENCRYPT sym_init="56" sym_algorithm="Blowfish" asym_init="512" asym_algorithm="RSA"/>

       

       

        <FRAG2 frag_size="60k"/>

        <pbcast.STATE_TRANSFER/>

      </config>

        • 1. Re: JGroups fail to merge two nodes
          dan.berindei

          Looks like some messages are ignored:

          2015-05-08 02:03:57,270  WARN  [org.jgroups.protocols.ENCRYPT] attempting to use stored cipher as message does not use current encryption version

          2015-05-08 02:03:57,270  WARN  [org.jgroups.protocols.ENCRYPT] unable to find a matching cipher in previous key map

          2015-05-08 02:03:57,270  WARN  [org.jgroups.protocols.ENCRYPT] unrecognised cipher; discarding message

           

          Since you don't have UNICAST3 in the stack, those discarded messages are not retransmitted. I think adding UNICAST3 should fix your problem, if not, you may have encountered an ENCRYPT issue.

           

          BTW, your protocol list looks pretty old (e.g. NAKACK, BARRIER, STATE_TRANSFER), you should definitely try to rebase on a more recent Infinispan default configuration.