1 Reply Latest reply on Oct 31, 2007 11:45 AM by manik

    Distributed cache crash after four days of operation

    xxxz

      Hi,

      we are using jboss cache in a cluster of two machines. The caches in the cluster are propagating asynchronous invalidations.
      We observed strange behavior after four days in production environment. One instance of cache sent 8MB of state info and the other cache instance crashed. We suspect that the second instance crashed because the first one sent a lot of data. What is weird is the first intance sending 8MB of state info. How is this possible when we only use invalidation ?

      The log from the two machines :

      MACHINE A

      [10/29/07 13:26:39:745 CET] 00000029 TreeCache I org.jboss.cache.TreeCache viewAccepted viewAccepted(): [192.168.200.33:4045|5] [192.168.200.33:4045, 192.168.200.32:1211]
      [10/29/07 13:26:46:917 CET] 00000029 StateTransfer I org.jboss.cache.statetransfer.StateTransferGenerator_140 generateStateTransfer returning the state for tree rooted in /(8388608 bytes)
      


      MACHINE B
      [10/29/07 13:26:38:836 CET] 00000079 TreeCache I org.jboss.cache.TreeCache viewAccepted viewAccepted(): [192.168.200.33:4045|5] [192.168.200.33:4045, 192.168.200.32:1211]
      [10/29/07 13:26:39:945 CET] 00000075 JChannel I org.jgroups.JChannel$CloserThread run fetching the state (auto_getstate=true)
      [10/29/07 13:26:44:961 CET] 00000075 JChannel I org.jgroups.JChannel$CloserThread run state transfer failed
      [10/29/07 13:26:59:788 CET] 00000078 STATE_TRANSFE W org.jgroups.protocols.pbcast.STATE_TRANSFER handleViewChange discovered that the state provider (192.168.200.33:4045) crashed; will return null state to application
      [10/29/07 13:26:59:788 CET] 00000078 STATE_TRANSFE W org.jgroups.protocols.pbcast.STATE_TRANSFER handleStateRsp digest received from 192.168.200.32:1211 is null, skipping setting digest !
      [10/29/07 13:26:59:788 CET] 00000078 STATE_TRANSFE W org.jgroups.protocols.pbcast.STATE_TRANSFER handleStateRsp state received from 192.168.200.32:1211 is null, will return null state to application
      [10/29/07 13:26:59:788 CET] 00000078 TreeCache I org.jboss.cache.TreeCache viewAccepted viewAccepted(): [192.168.200.32:1211|6] [192.168.200.32:1211]
      


      cahce configuration:

      <server>
       <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar" />
       <mbean code="org.jboss.cache.TreeCache" name="jboss.cache:service=ISRTreeCache">
       <attribute name="TransactionManagerLookupClass">org.jboss.cache.GenericTransactionManagerLookup</attribute>
      
       <!-- depends>jboss:service=Naming</depends>
       <depends>jboss:service=TransactionManager</depends -->
      
       <!--
       Node locking scheme :
       PESSIMISTIC (default)
       OPTIMISTIC
       -->
       <attribute name="NodeLockingScheme">OPTIMISTIC</attribute>
       <!--
       Node locking isolation level :
       SERIALIZABLE
       REPEATABLE_READ (default)
       READ_COMMITTED
       READ_UNCOMMITTED
       NONE
       (ignored if NodeLockingScheme is OPTIMISTIC)
       -->
       <attribute name="IsolationLevel">REPEATABLE_READ</attribute>
       <!-- Lock parent before doing node additions/removes -->
       <attribute name="LockParentForChildInsertRemove">true</attribute>
       <!-- Valid modes are LOCAL
       REPL_ASYNC
       REPL_SYNC
       INVALIDATION_ASYNC
       INVALIDATION_SYNC
       -->
       <attribute name="CacheMode">INVALIDATION_ASYNC</attribute>
       <!-- Name of cluster. Needs to be the same for all TreeCache nodes in a
       cluster, in order to find each other -->
       <attribute name="ClusterName">ISR</attribute>
       <!-- Whether each interceptor should have an mbean
       registered to capture and display its statistics. -->
       <attribute name="UseInterceptorMbeans">false</attribute>
      
       <attribute name="ClusterConfig">
       <config>
       <!-- UDP: if you have a multihomed machine,
       set the bind_addr attribute to the appropriate NIC IP address
       bind_addr="192.168.200.32"
       -->
       <!-- UDP: On Windows machines, because of the media sense feature
       being broken with multicast (even after disabling media sense)
       set the loopback attribute to true
       -->
       <UDP mcast_port="45454" mcast_addr="228.1.2.3" tos="16"
       ucast_recv_buf_size="20000000" ucast_send_buf_size="640000"
       mcast_recv_buf_size="25000000" mcast_send_buf_size="640000"
       loopback="true" discard_incompatible_packets="true"
       max_bundle_size="10000" max_bundle_timeout="30"
       use_incoming_packet_handler="true"
       use_outgoing_packet_handler="false" ip_ttl="2"
       enable_diagnostics="false" down_thread="false" up_thread="false"
       enable_bundling="true" />
       <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false" />
       <MERGE2 min_interval="10000" max_interval="20000" />
       <FD shun="true" up_thread="true" down_thread="true" />
       <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false" />
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false"
       down_thread="false" />
       <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false" />
       <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false" />
       <FRAG frag_size="8192" down_thread="false" up_thread="false" />
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true" />
       <pbcast.STATE_TRANSFER up_thread="false" down_thread="false" />
       </config>
      
       </attribute>
      
       ...
       ...
      
       </mbean>
      </server>
      


      JBoss cache version: 1.4.1.SP4
      JGroups version: 2.4.1

      Has anyone any idea's what is going on ? Any Help Is Appreciated.

      martin

        • 1. Re: Distributed cache crash after four days of operation
          manik

          By the look of things, it isn't the receiver that crashed but the machine generating state.

          That much state being around is entirely possible regardless of your cache mode. Just because you use invalidations doesn't mean that much state cannot exist in memory.

          Is this easily reproducible? How does machine A "crash"? Does the JVM exit?