3 Replies Latest reply on Jun 6, 2011 11:08 AM by mircea.markus

pbcast.NACKACK dropped message after reconnect network

ghostho May 27, 2011 4:53 AM

Hi,

i have infinispan in two server. I put some entries and then i disconnect the network. After reconnect both caches find eachother but i get this message:

2011-05-27 10:45:27,768 | WARN | ,PC1-41273 | groups.protocols.pbcast.NAKACK 788 | PC1-41273: dropped messag

e from PC1-25543 (not in table [PC1-41273]), view=[PC1-41273|2] [PC1-41273]

and this message

2011-05-27 10:45:27,783 | INFO | ,PC1-41273 | n.util.logging.AbstractLogImpl 20 | Received new, MERGED clus

ter view: MergeView::[PC1-25543|3] [PC1-25543, PC1-41273], subgroups=[[PC1-25543|2] [PC1-25543], [PC1-41273|2]

[PC1-41273]]

The keys are not the same. Its not replicated.

How can i fixed this ???

my config

<config xmlns="urn:org:jgroups" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="urn:org:jgroups file:schema/JGroups-2.8.xsd">

<!--

TCP based stack, with flush, flow control and message bundling. This

is usually used when IP multicasting cannot be used in a network, e.g.

because it is disabled (routers discard multicast). Note that

TCP.bind_addr and TCPPING.initial_hosts should be set, possibly via

system properties, e.g. -Djgroups.bind_addr=192.168.5.2 and

-Djgroups.tcpping.initial_hosts=192.168.5.2[7800]"

-->

<TCP start_port="1164" loopback="true" recv_buf_size="20000000"

send_buf_size="640000" discard_incompatible_packets="true"

max_bundle_size="64000" max_bundle_timeout="30"

use_incoming_packet_handler="true" enable_bundling="true"

use_send_queues="false" sock_conn_timeout="300"

skip_suspected_members="true" use_concurrent_stack="true"

thread_pool.enabled="true" thread_pool.min_threads="1"

thread_pool.max_threads="25" thread_pool.keep_alive_time="5000"

thread_pool.queue_enabled="false" thread_pool.queue_max_size="100"

thread_pool.rejection_policy="run" oob_thread_pool.enabled="true"

oob_thread_pool.min_threads="1" oob_thread_pool.max_threads="8"

oob_thread_pool.keep_alive_time="5000" oob_thread_pool.queue_enabled="false"

oob_thread_pool.queue_max_size="100" oob_thread_pool.rejection_policy="run" />

<TCPPING timeout="3000"

initial_hosts="${jgroups.tcpping.initial_hosts:localhost[7800],localhost[7801]}"

port_range="1" num_initial_members="3" />

<MPING bind_addr="${jgroups.bind_addr:127.0.0.1}" break_on_coord_rsp="true"

mcast_addr="${jgroups.udp.mcast_addr:228.6.7.8}" mcast_port="${jgroups.udp.mcast_port:46655}" ip_ttl="${jgroups.udp.ip_ttl:2}"

num_initial_members="3"/>

<FD_SOCK />

<VERIFY_SUSPECT timeout="1500" />

<pbcast.NAKACK max_xmit_size="90000" use_mcast_xmit="false"

gc_lag="0" retransmit_timeout="300,600,1200,2400,4800"

discard_delivered_msgs="false" />

<VIEW_SYNC avg_send_interval="10000"/>

<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

max_bytes="400000" />

<pbcast.GMS print_local_addr="true" join_timeout="3000"

join_retry_timeout="2000" shun="false" view_bundling="true" />

<pbcast.STREAMING_STATE_TRANSFER />

<pbcast.FLUSH timeout="0" />

</config>

1. Re: pbcast.NACKACK dropped message after reconnect network

cbo_ May 27, 2011 10:15 AM (in response to ghostho)

Sounds like it is working as designed actually. You have a split brain scenario in this case. If you look at Infinispan as Infinispan plus Jgroups the messages you are seeing may make more sense. When you disconnected the network, the cluster must have detected this and was broken/separate. JGroups was later able to heal the cluster (via the MERGED message), but Infinispan remains isolated between your 2 JVMs. This is by design and I believe the basic principal here is that either JVM could have made changes to their caches while the cluster was broken. Based on the idea that it would not be clear whose deltas were correct the caches remain separated. This requires application knowledge to figure out which JVM may be correct. I believe there are only 2 ways to fix this situation. First, you can choose to restart one of your JVMs and get them to sync once again (via StateTransfer). Or, you can reconnect to the cache from your application (effectively re-syncing to your other JVM).
Actions
2. Re: pbcast.NACKACK dropped message after reconnect network

ghostho May 31, 2011 5:15 AM (in response to cbo_)

Hello,

how can i reconnect the cache ?? I get a new Instance of the cache with my application. The problem is, when i get a reconnect of my network, the NAKACK has the wrong destination address. Because its only 0.0.0.0 and the port. Its not the right address. How can i fix it ?? Can i reconnect manually or resync manually ?
Actions
3. Re: pbcast.NACKACK   dropped message after reconnect network

mircea.markus Jun 6, 2011 11:08 AM (in response to ghostho)

How can i fixed this ???
You can register a @Listener to be notified on view merges.
E.g.
   @Listener
   public static class MergedViewListener {

      public volatile boolean merged;

      @Merged
      public void mergedView(MergeEvent me) {
         log.infof("View merged received %s", me);
         merged = true;
      }
   }
and then register it:
      CacheManager cm = getCacheManager();
      cm.addListener(new MergedViewListener());
Actions

Go to original post