1 Reply Latest reply on Nov 18, 2005 7:43 AM by belaban

    JBossCache intermittent errors

      We are using JBossCache 1.2.4 in a highly concurrent IM application. We have four applications each on a different machine, and not using JBoss AS. The load on these machines in terms of cache activity is quite high.

      The cache size is usually around 40,000 entries.

      The problem is that from time to time it seems that the cluster loses synchronicity, and we have lots of errors like the following:

      2005-11-18 11:21:12 WARN GMS: checkSelfInclusion() failed, 192.168.100.84:49338 is not a member of view [192.168.100.86:53254|80] [192.168.100.86:53254, 192.168.100.82:50501, 192.168.100.79:56495]; discarding view
      2005-11-18 11:21:12 WARN GMS: I (192.168.100.84:49338) am being shunned, will leave and rejoin group (prev_members are [192.168.100.86:53254 192.168.100.82:50501 192.168.100.84:49338 192.168.100.79:56495 ])
      2005-11-18 11:21:12 WARN NAKACK: 192.168.100.84:49338] discarded message from non-member 192.168.100.84:49338

      -------------------------------------------------------
      GMS: address is 192.168.100.84:53549
      -------------------------------------------------------
      2005-11-18 11:21:22 WARN STABLE: ResumeTask resumed message garbage collection - this should be done by a RESUME_STABLE event; check why this event was not received (or increase max_suspend_time for large state transfers)
      2005-11-18 11:22:37 ERROR NAKACK: (requester=192.168.100.84:53549, local_addr=192.168.100.84:53549) message with seqno=0 not found in sent_msgs ! sent_msgs=[11 - 151]

      On another machine:

      WARN UNICAST: [192.168.100.82:55352] seqno 63829 from 192.168.100.84:50296 is not tagged as the first message sent by 192.168.100.84:50296; however, the table for received messages from 192.168.100.84:50296 is still null ! We probably haven't received the first message from 192.168.100.84:50296 ! Discarding last pid: 14515;

      org.jboss.cache.ReplicationException: rsp=sender=192.168.100.84:53561, retval=null, received=false, suspected=true

      2005-11-18 11:46:27 org.jboss.cache.ReplicationException: rsp=sender=192.168.100.86:50307, retval=null, received=false, suspected=false
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:3505)
      at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:3526)
      at org.jboss.cache.interceptors.ReplicationInterceptor.handleReplicatedMethod(ReplicationInterceptor.java:122)
      at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:87)
      at org.jboss.cache.TreeCache.invokeMethod(TreeCache.java:4339)
      ...


      The TreeCache configuration is as follows:


      <?xml version="1.0" encoding="UTF-8"?>
      <server>
      
       <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar" />
      
      
       <mbean code="org.jboss.cache.TreeCache"
       name="jboss.cache:service=TreeCache">
      
       <depends>jboss:service=Naming</depends>
       <depends>jboss:service=TransactionManager</depends>
      
       <attribute name="IsolationLevel">READ_UNCOMMITTED</attribute>
      
       <attribute name="CacheMode">REPL_ASYNC</attribute>
      
       <attribute name="UseReplQueue">false</attribute>
       <attribute name="ClusterName">standalone-cache</attribute>
      
       <attribute name="ClusterConfig">
       <config>
       <UDP bind_addr="192.168.100.84" mcast_addr="228.1.2.3"
       mcast_port="54444" ip_ttl="2" ip_mcast="true"
       mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
       ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
       loopback="false" />
       <PING timeout="2000" num_initial_members="3"
       up_thread="false" down_thread="false" />
       <MERGE2 min_interval="10000" max_interval="20000" />
       <FD shun="true" up_thread="true" down_thread="true" />
       <VERIFY_SUSPECT timeout="1500" up_thread="false"
       down_thread="false" />
       <pbcast.NAKACK gc_lag="50" max_xmit_size="8192"
       retransmit_timeout="600,1200,2400,4800" up_thread="false"
       down_thread="false" />
       <UNICAST timeout="600,1200,2400" window_size="100"
       min_threshold="10" down_thread="false" />
       <pbcast.STABLE desired_avg_gossip="20000"
       up_thread="false" down_thread="false" />
       <FRAG frag_size="1024" down_thread="false"
       up_thread="false" />
       <pbcast.GMS join_timeout="5000"
       join_retry_timeout="2000" shun="true" print_local_addr="true" />
       <pbcast.STATE_TRANSFER up_thread="false"
       down_thread="false" />
       </config>
       </attribute>
      
      
       <attribute name="FetchStateOnStartup">true</attribute>
      
       <attribute name="InitialStateRetrievalTimeout">60000</attribute>
       <attribute name="SyncReplTimeout">2000</attribute>
       <attribute name="LockAcquisitionTimeout">10000</attribute>
       <attribute name="UseMarshalling">true</attribute>
       <attribute name="InactiveOnStartup">false</attribute>
       <attribute name="EvictionPolicyClass">
       org.jboss.cache.eviction.LRUPolicy
       </attribute>
      
       <attribute name="EvictionPolicyConfig">
       <config>
       <attribute name="wakeUpIntervalSeconds">10</attribute>
       <region name="/mxit/proxies">
       <attribute name="maxNodes">50</attribute>
       <attribute name="timeToLiveSeconds">20</attribute>
       </region>
       <region name="/mxit/clients">
       <attribute name="maxNodes">40000</attribute>
       <attribute name="timeToLiveSeconds">43200</attribute>
       </region>
       <region name="/_default_">
       <attribute name="maxNodes">500</attribute>
       <attribute name="timeToLiveSeconds">90</attribute>
       </region>
       </config>
       </attribute>
       </mbean>
      </server>
      


      Usually the nodes recover automatically after some time, but during this time the entire system is unstable, and since it is a highly concurrent system this is obviously undesirable.

      Any help would be appreciated!