1 Reply Latest reply on Jan 8, 2007 8:19 AM by Manik Surtani

    shun=true and JVM freezes + replication queue

    Renaud Bruyeron Newbie

      We are running into issues with our jbosscache cluster running inside a Tomcat 5.5.17 on JDK1.5.08. We are using the BETA of 1.4.1 that was released back in november.
      Here's the current treecache config and below the current jgroups stack we are using:

       <bean id="treeCacheInstance" class="xxx.TreeCache">
       <property name="isolationLevel" value="REPEATABLE_READ" />
       <property name="cacheMode" value="REPL_ASYNC" />
       <property name="clusterName" value="${treeCache.clusterName}" />
       <property name="useReplQueue" value="false" />
       <property name="replQueueInterval" value="0" />
       <property name="replQueueMaxElements" value="0" />
       <property name="fetchInMemoryState" value="false" />
       <property name="initialStateRetrievalTimeout" value="20000" />
       <property name="syncReplTimeout" value="20000" />
       <property name="lockAcquisitionTimeout" value="5000" />
       <property name="useRegionBasedMarshalling" value="false" />
       <!-- <property name="useInterceptorMbeans" value="false"/> -->
       <property name="clusterProperties"
       value="${treeCache.clusterProperties}" />
       <property name="serviceName">
       <bean class="javax.management.ObjectName">
       <constructor-arg value="jboss.cache:service=${treeCache.clusterName},name=myapp"/>
       </bean>
       </property>
       <property name="evictionPolicyClass" value="org.jboss.cache.eviction.LRUPolicy"/>
       <property name="maxAgeSeconds" value="${treeCache.eviction.maxAgeSeconds}"/>
       <property name="maxNodes" value="${treeCache.eviction.maxNodes}"/>
       <property name="timeToLiveSeconds" value="${treeCache.eviction.timeToLiveSeconds}"/>
       </bean>
      

      jgroups stack:
      treeCache.clusterProperties=UDP(ip_mcast=true;ip_ttl=64;loopback=false;mcast_addr=${treeCache.mcastAddress};mcast_port=${treeCache.mcastPort};mcast_recv_buf_size=80000;mcast_send_buf_size=150000;ucast_recv_buf_size=80000;ucast_send_buf_size=150000;bind_addr=${treeCache.bind_addr}):\
      PING(down_thread=false;num_initial_members=3;timeout=2000;up_thread=false):\
      MERGE2(max_interval=20000;min_interval=10000):\
      FD_SOCK(down_thread=false;up_thread=false):\
      FD(timeout=4000;max_tries=4;down_thread=false;up_thread=false;shun=false):\
      VERIFY_SUSPECT(down_thread=false;timeout=3000;up_thread=false):\
      pbcast.NAKACK(down_thread=false;gc_lag=50;retransmit_timeout=600,1200,2400,4800;up_thread=false):\
      pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):\
      UNICAST(down_thread=false;;timeout=600,1200,2400):\
      FRAG(down_thread=false;frag_size=8192;up_thread=false):\
      pbcast.GMS(join_retry_timeout=3000;join_timeout=5000;print_local_addr=true;shun=false):\
      pbcast.STATE_TRANSFER(down_thread=true;up_thread=true)
      treeCache.eviction.maxNodes=5000
      treeCache.eviction.maxAgeSeconds=1000
      treeCache.eviction.timeToLiveSeconds=900
      


      We had an event today where one of the 4 JVMs in the cluster froze for about 30s (that's (possibly) a separate problem we are investigating). Here's a sample of the logging from that event:

      # here is the view: 12 members (4 JVMs with 3 TreeCache instances each)
      APP2 05/01/2007 09:04:45 INFO [TreeCache.java:5431] - viewAccepted(): [10.163.195.24:33030|131] [10.163.195.24:33030, 10.163.195.24:33033, 10.163.1
      95.24:33035, 10.163.195.21:32875, 10.163.195.21:32877, 10.163.195.21:32879, 10.163.195.23:33033, 10.163.195.23:33035, 10.163.195.23:33037, 10.163.195.22:3308
      7, 10.163.195.22:33089, 10.163.195.22:33091]
      ....
      # at 12:05:22, the JVM of APP2 froze for about 30s
      ...
      # APP4 is the coordinator at this point
      APP4 05/01/2007 12:05:50 WARN [GMS.java:413] - failed to collect all ACKs (11) for view [10.163.195.24:33030|132] [10.163.195.24:33030, 10.163.195.24:33033, 10.163.195.24:33035, 10.163.195.21:32875, 10.163.195.21:32877, 10.163.195.21:32879, 10.163.195.23:33033, 10.163.195.23:33035, 10.163.195.23:33037, 10.163.195
      .22:33089, 10.163.195.22:33091] after 2000ms, missing ACKs from [10.163.195.22:33089, 10.163.195.22:33091] (received=[10.163.195.24:33030, 10.163.195.21:3287
      9, 10.163.195.23:33033, 10.163.195.24:33033, 10.163.195.24:33035, 10.163.195.21:32877, 10.163.195.21:32875, 10.163.195.23:33035, 10.163.195.23:33037]), local
      _addr=10.163.195.24:33030
      APP2 05/01/2007 12:05:58 WARN [FD.java:250] - I was suspected by 10.163.195.23:33037; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
      APP2 05/01/2007 12:06:00 WARN [GMS.java:478] - I (10.163.195.22:33087) am not a member of view [10.163.195.24:33030|132] [10.163.195.24:33030, 10.163.195.24:
      33033, 10.163.195.24:33035, 10.163.195.21:32875, 10.163.195.21:32877, 10.163.195.21:32879, 10.163.195.23:33033, 10.163.195.23:33035, 10.163.195.23:33037, 10.
      163.195.22:33089, 10.163.195.22:33091]; discarding view
      ...
      APP{1,3-12} 2007-01-05 12:13:44,243 WARN [org.jgroups.protocols.pbcast.NAKACK] - <10.163.195.22:33089] discarded message from non-member 10.163.195.22:33087, m
      y view is [10.163.195.24:33030|132] [10.163.195.24:33030, 10.163.195.24:33033, 10.163.195.24:33035, 10.163.195.21:32875, 10.163.195.21:32877, 10.163.195.21:3
      2879, 10.163.195.23:33033, 10.163.195.23:33035, 10.163.195.23:33037, 10.163.195.22:33089, 10.163.195.22:33091]>

      We had to manually kill APP2 and restart it.
      After reading more documentation, we think that shun=true is needed in our use-case: if I understand http://wiki.jboss.org/wiki/Wiki.jsp?page=Shunning correctly, this would have caused APP2 to shun itself from the group when it noticed that it was not part of the view, and then rejoin (since TreeCache configures the JChannel to autorejoin). Am I right?

      I have several more questions regarding the settings of our stack:
      * with shun=true, and assuming the "freezes" never exceed 30s, should I configure the timeouts and retries on the FD protocol so that they are > 30s (for example timeout=10000;max_tries=4, so that 40 > 30)?
      * I noticed the UseReplQueue option: could you describe the use-cases where this might be useful (in the JBossCache distrib, the only example of its use is for the large http session replication cluster config)?