2 Replies Latest reply on Nov 7, 2006 7:51 AM by Bela Ban

    wrong coordinator causes join to fail

    Renaud Bruyeron Newbie

      I had a problem recently in production whereby one of the instances in the cluster failed and had to be terminated (via kill -9).
      This is part of a cluster of 4 servers, on which there are 12 cache instances (1 JVM per server, 3 cache per JVM) in REPL_ASYNC mode.

      After a node failure, we restarted one of the JVMs, and then restarted 2 of the remaining JVMs. To make things simple, we first restarted B, then A and D, but left C running.

      We noticed the following messages in the logs of A B and D after restart:
      06/11/2006 14:10:24 WARN [ClientGmsImpl.java:126] - join(A:32937) sent to B:32955 timed out, retrying

      B:32955 was the coordinator before B was killed with kill -9. It seems that C (the remaining member) incorrectly things that B:32955 is still the coordinator. Here's the protocol stack I am using:
      UDP(ip_mcast=true;ip_ttl=64;loopback=false;mcast_addr=${treeCache.mcastAddress};mcast_port=${treeCache.mcastPort};mcast_recv_buf_size=80000;mcast_send_buf_size=150000;ucast_recv_buf_size=80000;ucast_send_buf_size=150000;bind_addr=${treeCache.bind_addr}):\
      PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):\
      MERGE2(max_interval=20000;min_interval=10000):\
      FD_SOCK:\
      VERIFY_SUSPECT(down_thread=false;timeout=1500;up_thread=false):\
      pbcast.NAKACK(down_thread=false;gc_lag=50;retransmit_timeout=600,1200,2400,4800;up_thread=false):\
      pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):\
      UNICAST(down_thread=false;;timeout=600,1200,2400):\
      FRAG(down_thread=false;frag_size=8192;up_thread=false):\
      pbcast.GMS(join_retry_timeout=2000;join_timeout=5000;print_local_addr=true;shun=true):\
      pbcast.STATE_TRANSFER(down_thread=true;up_thread=true)

      When I tried to replicate this scenario on my dev system, the failure detection worked and a new coordinator was successfully elected - therefore I think I may have hit upon a borderline condition.

      Any idea on what could be going on?