1 Reply Latest reply on Oct 31, 2007 11:45 AM by manik

Distributed cache crash after four days of operation

xxxz Oct 31, 2007 4:41 AM

Hi,

we are using jboss cache in a cluster of two machines. The caches in the cluster are propagating asynchronous invalidations.
We observed strange behavior after four days in production environment. One instance of cache sent 8MB of state info and the other cache instance crashed. We suspect that the second instance crashed because the first one sent a lot of data. What is weird is the first intance sending 8MB of state info. How is this possible when we only use invalidation ?

The log from the two machines :

MACHINE A

[10/29/07 13:26:39:745 CET] 00000029 TreeCache I org.jboss.cache.TreeCache viewAccepted viewAccepted(): [192.168.200.33:4045|5] [192.168.200.33:4045, 192.168.200.32:1211]
[10/29/07 13:26:46:917 CET] 00000029 StateTransfer I org.jboss.cache.statetransfer.StateTransferGenerator_140 generateStateTransfer returning the state for tree rooted in /(8388608 bytes)

MACHINE B

[10/29/07 13:26:38:836 CET] 00000079 TreeCache I org.jboss.cache.TreeCache viewAccepted viewAccepted(): [192.168.200.33:4045|5] [192.168.200.33:4045, 192.168.200.32:1211]
[10/29/07 13:26:39:945 CET] 00000075 JChannel I org.jgroups.JChannel$CloserThread run fetching the state (auto_getstate=true)
[10/29/07 13:26:44:961 CET] 00000075 JChannel I org.jgroups.JChannel$CloserThread run state transfer failed
[10/29/07 13:26:59:788 CET] 00000078 STATE_TRANSFE W org.jgroups.protocols.pbcast.STATE_TRANSFER handleViewChange discovered that the state provider (192.168.200.33:4045) crashed; will return null state to application
[10/29/07 13:26:59:788 CET] 00000078 STATE_TRANSFE W org.jgroups.protocols.pbcast.STATE_TRANSFER handleStateRsp digest received from 192.168.200.32:1211 is null, skipping setting digest !
[10/29/07 13:26:59:788 CET] 00000078 STATE_TRANSFE W org.jgroups.protocols.pbcast.STATE_TRANSFER handleStateRsp state received from 192.168.200.32:1211 is null, will return null state to application
[10/29/07 13:26:59:788 CET] 00000078 TreeCache I org.jboss.cache.TreeCache viewAccepted viewAccepted(): [192.168.200.32:1211|6] [192.168.200.32:1211]

cahce configuration:

<server>
 <classpath codebase="./lib" archives="jboss-cache.jar, jgroups.jar" />
 <mbean code="org.jboss.cache.TreeCache" name="jboss.cache:service=ISRTreeCache">
 <attribute name="TransactionManagerLookupClass">org.jboss.cache.GenericTransactionManagerLookup</attribute>

 <!-- depends>jboss:service=Naming</depends>
 <depends>jboss:service=TransactionManager</depends -->

 <!--
 Node locking scheme :
 PESSIMISTIC (default)
 OPTIMISTIC
 -->
 <attribute name="NodeLockingScheme">OPTIMISTIC</attribute>
 <!--
 Node locking isolation level :
 SERIALIZABLE
 REPEATABLE_READ (default)
 READ_COMMITTED
 READ_UNCOMMITTED
 NONE
 (ignored if NodeLockingScheme is OPTIMISTIC)
 -->
 <attribute name="IsolationLevel">REPEATABLE_READ</attribute>
 <!-- Lock parent before doing node additions/removes -->
 <attribute name="LockParentForChildInsertRemove">true</attribute>
 <!-- Valid modes are LOCAL
 REPL_ASYNC
 REPL_SYNC
 INVALIDATION_ASYNC
 INVALIDATION_SYNC
 -->
 <attribute name="CacheMode">INVALIDATION_ASYNC</attribute>
 <!-- Name of cluster. Needs to be the same for all TreeCache nodes in a
 cluster, in order to find each other -->
 <attribute name="ClusterName">ISR</attribute>
 <!-- Whether each interceptor should have an mbean
 registered to capture and display its statistics. -->
 <attribute name="UseInterceptorMbeans">false</attribute>

 <attribute name="ClusterConfig">
 <config>
 <!-- UDP: if you have a multihomed machine,
 set the bind_addr attribute to the appropriate NIC IP address
 bind_addr="192.168.200.32"
 -->
 <!-- UDP: On Windows machines, because of the media sense feature
 being broken with multicast (even after disabling media sense)
 set the loopback attribute to true
 -->
 <UDP mcast_port="45454" mcast_addr="228.1.2.3" tos="16"
 ucast_recv_buf_size="20000000" ucast_send_buf_size="640000"
 mcast_recv_buf_size="25000000" mcast_send_buf_size="640000"
 loopback="true" discard_incompatible_packets="true"
 max_bundle_size="10000" max_bundle_timeout="30"
 use_incoming_packet_handler="true"
 use_outgoing_packet_handler="false" ip_ttl="2"
 enable_diagnostics="false" down_thread="false" up_thread="false"
 enable_bundling="true" />
 <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false" />
 <MERGE2 min_interval="10000" max_interval="20000" />
 <FD shun="true" up_thread="true" down_thread="true" />
 <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false" />
 <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192" up_thread="false"
 down_thread="false" />
 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false" />
 <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false" />
 <FRAG frag_size="8192" down_thread="false" up_thread="false" />
 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true" />
 <pbcast.STATE_TRANSFER up_thread="false" down_thread="false" />
 </config>

 </attribute>

 ...
 ...

 </mbean>
</server>

JBoss cache version: 1.4.1.SP4
JGroups version: 2.4.1

Has anyone any idea's what is going on ? Any Help Is Appreciated.

martin

1. Re: Distributed cache crash after four days of operation

manik Oct 31, 2007 11:45 AM (in response to xxxz)

By the look of things, it isn't the receiver that crashed but the machine generating state.

That much state being around is entirely possible regardless of your cache mode. Just because you use invalidations doesn't mean that much state cannot exist in memory.

Is this easily reproducible? How does machine A "crash"? Does the JVM exit?
Actions