0 Replies Latest reply on Sep 25, 2007 7:09 PM by ranous

    Deadlock problem with JBossCache

    ranous

      I'm using JBossCache 1.4.1SP4 embedded into a server based application (ie, not using JBoss App Server). We recently saw a problem that lead to all updates to the cache to fail.

      We're running two servers with each server running JBossCache with replication and each server has their own cacheloader configured for a local cache loader using the BdbjeCacheLoader.

      Everything was running fine for some time with many updates when we got an exception when trying to commit a transaction:

       java.lang.RuntimeException:
       at org.jboss.cache.interceptors.TxInterceptor$LocalSynchronizationHandler.beforeCompletion(TxInterceptor.java:1173)
       at org.jboss.cache.interceptors.OrderedSynchronizationHandler.beforeCompletion(OrderedSynchronizationHandler.java:75)
       at org.objectweb.jotm.SubCoordinator.doBeforeCompletion(SubCoordinator.java:1487)
       at org.objectweb.jotm.SubCoordinator.commit_one_phase(SubCoordinator.java:416)
       at org.objectweb.jotm.TransactionImpl.commit(TransactionImpl.java:239)
       at org.objectweb.jotm.Current.commit(Current.java:488)
       at danger.service.tpns.SubscriptionManager.subscribe(SubscriptionManager.java:198)
       at danger.service.tpns.SubscriptionManager.subscribeDevice(SubscriptionManager.java:150)
       at danger.service.tpns.ThirdPartyCalls.subscribe(ThirdPartyCalls.java:90)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:585)
       at org.apache.xmlrpc.Invoker.execute(Unknown Source)
       at org.apache.xmlrpc.XmlRpcWorker.invokeHandler(Unknown Source)
       at org.apache.xmlrpc.XmlRpcWorker.execute(Unknown Source)
       at org.apache.xmlrpc.XmlRpcServer.execute(Unknown Source)
       at org.apache.xmlrpc.XmlRpcServer.execute(Unknown Source)
       at org.apache.xmlrpc.WebServer$Connection.run(Unknown Source)
       at org.apache.xmlrpc.WebServer$Runner.run(Unknown Source)
       at java.lang.Thread.run(Thread.java:595)
       Caused by: org.jboss.cache.ReplicationException: rsp=sender=10.253.229.164:44284, retval=null, received=false, suspected=false
       at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4436)
       at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4358)
       at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4469)
       at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:110)
       at org.jboss.cache.interceptors.BaseRpcInterceptor.replicateCall(BaseRpcInterceptor.java:88)
       at org.jboss.cache.interceptors.ReplicationInterceptor.runPreparePhase(ReplicationInterceptor.java:152)
       at org.jboss.cache.interceptors.ReplicationInterceptor.invoke(ReplicationInterceptor.java:69)
       at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
       at org.jboss.cache.interceptors.CacheStoreInterceptor.invoke(CacheStoreInterceptor.java:136)
       at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
       at org.jboss.cache.interceptors.TxInterceptor.runPreparePhase(TxInterceptor.java:873)
       at org.jboss.cache.interceptors.TxInterceptor$LocalSynchronizationHandler.beforeCompletion(TxInterceptor.java:1147)
       ... 20 more
       Caused by: org.jboss.cache.lock.TimeoutException: Response timed out: sender=10.253.229.164:44284, retval=null, received=false, suspected=false
       at org.jboss.cache.TreeCache.callRemoteMethods(TreeCache.java:4434)
       ... 31 more
      


      Since that exception, all attempts to update the cache in either server replica fails with the following:

      Cannot add subscription to cache
       java.lang.RuntimeException:
       at org.jboss.cache.interceptors.TxInterceptor$LocalSynchronizationHandler.beforeCompletion(TxInterceptor.java:1173)
       at org.jboss.cache.interceptors.OrderedSynchronizationHandler.beforeCompletion(OrderedSynchronizationHandler.java:75)
       at org.objectweb.jotm.SubCoordinator.doBeforeCompletion(SubCoordinator.java:1487)
       at org.objectweb.jotm.SubCoordinator.commit_one_phase(SubCoordinator.java:416)
       at org.objectweb.jotm.TransactionImpl.commit(TransactionImpl.java:239)
       at org.objectweb.jotm.Current.commit(Current.java:488)
       at danger.service.tpns.SubscriptionManager.subscribe(SubscriptionManager.java:198)
       at danger.service.tpns.SubscriptionManager.subscribeDevice(SubscriptionManager.java:150)
       at danger.service.tpns.ThirdPartyCalls.subscribe(ThirdPartyCalls.java:90)
       at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:585)
       at org.apache.xmlrpc.Invoker.execute(Unknown Source)
       at org.apache.xmlrpc.XmlRpcWorker.invokeHandler(Unknown Source)
       at org.apache.xmlrpc.XmlRpcWorker.execute(Unknown Source)
       at org.apache.xmlrpc.XmlRpcServer.execute(Unknown Source)
       at org.apache.xmlrpc.XmlRpcServer.execute(Unknown Source)
       at org.apache.xmlrpc.WebServer$Connection.run(Unknown Source)
       at org.apache.xmlrpc.WebServer$Runner.run(Unknown Source)
       at java.lang.Thread.run(Thread.java:595)
       Caused by: com.sleepycat.je.DeadlockException: (JE 3.2.13) Lock expired. Locker 716_Thread-7_Txn: waited for lock on database=tpns node=71 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1190284084053 endTime=1190284084555
       Owners: [<LockInfo locker="645_UpHandler (STATE_TRANSFER)_Txn" type="WRITE"/>]
       Waiters: []
      
       at com.sleepycat.je.txn.LockManager.lock(LockManager.java:266)
       at com.sleepycat.je.txn.Txn.lockInternal(Txn.java:296)
       at com.sleepycat.je.txn.Locker.lock(Locker.java:257)
       at com.sleepycat.je.dbi.CursorImpl.lockLNDeletedAllowed(CursorImpl.java:2349)
       at com.sleepycat.je.tree.Tree.insert(Tree.java:2575)
       at com.sleepycat.je.dbi.CursorImpl.putLN(CursorImpl.java:988)
       at com.sleepycat.je.dbi.CursorImpl.putNoOverwrite(CursorImpl.java:1039)
       at com.sleepycat.je.Cursor.putAllowPhantoms(Cursor.java:863)
       at com.sleepycat.je.Cursor.putNoNotify(Cursor.java:798)
       at com.sleepycat.je.Cursor.putInternal(Cursor.java:752)
       at com.sleepycat.je.Cursor.putNoOverwrite(Cursor.java:342)
       at org.jboss.cache.loader.bdbje.BdbjeCacheLoader.createParentNodes(BdbjeCacheLoader.java:555)
       at org.jboss.cache.loader.bdbje.BdbjeCacheLoader.doPut(BdbjeCacheLoader.java:392)
       at org.jboss.cache.loader.bdbje.BdbjeCacheLoader.doPut(BdbjeCacheLoader.java:519)
       at org.jboss.cache.loader.bdbje.BdbjeCacheLoader.performTransaction(BdbjeCacheLoader.java:764)
       at org.jboss.cache.loader.bdbje.BdbjeCacheLoader.prepare(BdbjeCacheLoader.java:714)
       at org.jboss.cache.loader.AsyncCacheLoader.prepare(AsyncCacheLoader.java:262)
       at org.jboss.cache.interceptors.CacheStoreInterceptor.prepareCacheLoader(CacheStoreInterceptor.java:258)
       at org.jboss.cache.interceptors.CacheStoreInterceptor.invoke(CacheStoreInterceptor.java:131)
       at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:68)
       at org.jboss.cache.interceptors.TxInterceptor.runPreparePhase(TxInterceptor.java:873)
       at org.jboss.cache.interceptors.TxInterceptor$LocalSynchronizationHandler.beforeCompletion(TxInterceptor.java:1147)
       ... 19 more
      

      The cache configuration I'm using is as follows

      <?xml version="1.0" encoding="UTF-8"?>
      <server>
       <classpath codebase="/danger/local/lib" archives="jboss-cache.jar,jgroups-all.jar" />
       <mbean code="org.jboss.cache.aop.PojoCache" name="jboss.cache:service=PojoCache">
      
       <depends>jboss:service=Naming</depends>
       <depends>jboss:service=TransactionManager</depends>
       <attribute name="TransactionManagerLookupClass">danger.service.cluster.JotmTransactionManagerLookup</attribute>
       <attribute name="IsolationLevel">REPEATABLE_READ</attribute>
       <attribute name="CacheMode">REPL_SYNC</attribute>
       <attribute name="UseReplQueue">false</attribute>
       <attribute name="ReplQueueInterval">0</attribute>
       <attribute name="ReplQueueMaxElements">0</attribute>
       <attribute name="ClusterName">tpns</attribute>
      
       <attribute name="ClusterConfig">
       <config>
       <UDP mcast_addr="228.8.8.7" mcast_port="45567" ip_ttl="64" ip_mcast="true"
       mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
       ucast_send_buf_size="150000" ucast_recv_buf_size="80000" loopback="false" />
       <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false" />
       <MERGE2 min_interval="10000" max_interval="20000" />
       <!-- <FD shun="true" up_thread="true" down_thread="true" />-->
       <FD_SOCK />
       <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false" />
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" max_xmit_size="8192"
       up_thread="false" down_thread="false" />
       <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false" />
       <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false" />
       <FRAG frag_size="8192" down_thread="false" up_thread="false" />
       <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true" />
       <pbcast.STATE_TRANSFER up_thread="true" down_thread="true" />
       </config>
       </attribute>
      
       <attribute name="FetchStateOnStartup">true</attribute>
       <attribute name="InitialStateRetrievalTimeout">5000</attribute>
       <attribute name="SyncReplTimeout">10000</attribute>
       <attribute name="LockAcquisitionTimeout">15000</attribute>
       <attribute name="EvictionPolicyClass">org.jboss.cache.aop.eviction.AopLRUPolicy</attribute>
       <attribute name="EvictionPolicyConfig">
       <config>
       <attribute name="wakeUpIntervalSeconds">5</attribute>
       <region name="/_default_">
       <attribute name="maxNodes">0</attribute>
       <attribute name="timeToLiveSeconds">0</attribute>
       </region>
       </config>
       </attribute>
      
       <attribute name="CacheLoaderConfiguration">
       <config>
       <passivation>false</passivation>
       <preload>/</preload>
       <shared>false</shared>
      
       <cacheloader>
       <class>org.jboss.cache.loader.bdbje.BdbjeCacheLoader</class>
       <properties>location=/danger/tpns</properties>
       <async>false</async>
       <fetchPersistentState>true</fetchPersistentState>
       <ignoreModifications>false</ignoreModifications>
       <purgeOnStartup>false</purgeOnStartup>
       </cacheloader>
       </config>
       </attribute>
       </mbean>
      </server>
      



      It seems to be the first failure which happen when attempting replication cause some lock to be obtained in the BDBJE database and never released. Restarting both server seems to have cleared things up, but I'm very nervous about it happening again. Anyone have any ideas as to what might be going on or how to fix it?