9 Replies Latest reply on Mar 28, 2006 5:48 PM by akardell

    JBossCache 1.3 Beta 2

    akardell

      We have run a load test against our application swapping in the JBossCache 1.3 Beta 2 jars, using INVALIDATION_ASYNC as the CacheMode and READ_COMMITTED as the IsolationLevel. Approximately 5 minutes into the test, we start to receive many IdentityLock errors on various objects. Is there anything we can do to help troubleshoot this? Is this to be expected -- I thought the invalidation scheme avoided locking altogether, but I may not have an appropriate understanding of it.

      All of the errors look similar to the following:

      1052265 [PoolThread-47] ERROR org.jboss.cache.lock.IdentityLock - write lock for /com/abc/def/orm/Term/com.abc.def.orm.Term#6 could not be acquired after 0 ms. Locks: Read lock owners: {}
      Write lock owner: PoolThread-63
      (caller=PoolThread-47, lock info: write owner=PoolThread-63 (org.jboss.cache.lock.LockStrategyReadCommitted@b15853))

      Eventually, we start to get read-lock timeouts also, like the following:

      org.hibernate.cache.CacheException: org.jboss.cache.lock.TimeoutException: read lock for /org/hibernate/cache/UpdateTimestampsCache/[Accounts] could not be acquired by PoolThread-86 after 15000 ms. Locks: Read lock owners: {}
      Write lock owner: GlobalTransaction:<10.40.58.13:2603>:1
      , lock info: write owner=GlobalTransaction:<10.40.58.13:2603>:1 (org.jboss.cache.lock.LockStrategyReadCommitted@2d4ce3)

      Do we need new Hibernate jars to make use of the latest JBossCache jars?

      Any other thoughts / strategies to try? I'll help provide whatever information I can.

      Thanks,

      Aaron

        • 1. Re: JBossCache 1.3 Beta 2
          manik

          Hi there - do you still see this when using JBossCache 1.2.4.SP2 with REPL_ASYNC?

          And no, you don't need to upgrade your Hibernate jars as long as you're using Hibernate >= 3.0.2.



          • 2. Re: JBossCache 1.3 Beta 2
            akardell

            Under a substantial amount of load, it was not uncommon to get TimeoutException's and IdentityLock's in 1.2.4.SP2 with REPL_ASYNC

            • 3. Re: JBossCache 1.3 Beta 2
              manik

              Does this change if you have a really high timeout? Threads will block for longer (as expected), but I'd like to see if this affects anything - since the log message says timeout after o secs.

              • 4. Re: JBossCache 1.3 Beta 2
                akardell

                Perhaps I'm missing a setting? None of my timeouts are set to 0, as seen below. I can re-run a test, but which timeouts should I increase?

                Thanks!

                Aaron

                <?xml version="1.0" encoding="UTF-8" ?>
                <server>
                
                 <!-- ==================================================================== -->
                 <!-- Defines TreeCache configuration -->
                 <!-- ==================================================================== -->
                 <mbean code="org.jboss.cache.TreeCache" name="jboss.cache:service=TreeCache">
                 <depends>jboss:service=Naming</depends>
                 <depends>jboss:service=TransactionManager</depends>
                
                
                 <!-- Configure the TransactionManager -->
                 <attribute name="TransactionManagerLookupClass">org.jboss.cache.DummyTransactionManagerLookup</attribute>
                
                 <!--
                 Node locking level : SERIALIZABLE
                 REPEATABLE_READ (default)
                 READ_COMMITTED
                 READ_UNCOMMITTED
                 NONE
                 -->
                 <attribute name="IsolationLevel">READ_COMMITTED</attribute>
                
                 <!-- Valid modes are LOCAL
                 REPL_ASYNC
                 REPL_SYNC
                 -->
                 <attribute name="CacheMode">INVALIDATION_ASYNC</attribute>
                
                 <!-- Name of cluster. Needs to be the same for all clusters, in order
                 to find each other -->
                 <attribute name="ClusterName">TreeCache-Cluster</attribute>
                
                 <attribute name="ClusterConfig">
                 <config>
                 <!-- UDP: if you have a multihomed machine,
                 set the bind_addr attribute to the appropriate NIC IP address
                 -->
                 <!-- UDP: On Windows machines, because of the media sense feature
                 being broken with multicast (even after disabling media sense)
                 set the loopback attribute to true
                 -->
                 <UDP mcast_addr="228.8.8.8" mcast_port="45567" ip_ttl="64" ip_mcast="true"
                 mcast_send_buf_size="150000" mcast_recv_buf_size="80000" ucast_send_buf_size="150000"
                 ucast_recv_buf_size="80000" loopback="true" bind_addr="0.0.0.0" />
                 <PING timeout="2000" num_initial_members="3" up_thread="false" down_thread="false" />
                 <MERGE2 min_interval="10000" max_interval="20000" />
                 <FD shun="true" up_thread="true" down_thread="true" />
                 <VERIFY_SUSPECT timeout="1500" up_thread="false" down_thread="false" />
                 <pbcast.NAKACK gc_lag="50" max_xmit_size="8192" retransmit_timeout="600,1200,2400,4800" up_thread="false"
                 down_thread="false" />
                 <UNICAST timeout="600,1200,2400" window_size="100" min_threshold="10" down_thread="false" />
                 <pbcast.STABLE desired_avg_gossip="20000" up_thread="false" down_thread="false" />
                 <FRAG frag_size="8192" down_thread="false" up_thread="false" />
                 <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true" />
                 <pbcast.STATE_TRANSFER up_thread="false" down_thread="false" />
                 </config>
                 </attribute>
                
                 <!-- The max amount of time (in milliseconds) we wait until the
                 initial state (ie. the contents of the cache) are retrieved from
                 existing members in a clustered environment
                 -->
                 <attribute name="InitialStateRetrievalTimeout">5000</attribute>
                
                 <!-- Number of milliseconds to wait until all responses for a
                 synchronous call have been received.
                 -->
                 <attribute name="SyncReplTimeout">10000</attribute>
                
                 <!-- Max number of milliseconds to wait for a lock acquisition -->
                 <attribute name="LockAcquisitionTimeout">15000</attribute>
                
                 <!-- Name of the eviction policy class. -->
                 <attribute name="EvictionPolicyClass">org.jboss.cache.eviction.LRUPolicy</attribute>
                
                 <!-- Specific eviction policy configurations. This is LRU -->
                 <attribute name="EvictionPolicyConfig">
                 <config>
                 <attribute name="wakeUpIntervalSeconds">5</attribute>
                 <!-- Cache wide default -->
                 <region name="/_default_">
                 <attribute name="maxNodes">1000</attribute>
                 <attribute name="timeToLiveSeconds">3600</attribute>
                 </region>
                 </config>
                 </attribute>
                
                 </mbean>
                </server>
                



                • 5. Re: JBossCache 1.3 Beta 2
                  manik

                  lock acquisition timeout

                  • 6. Re: JBossCache 1.3 Beta 2
                    akardell

                    I tried increasing the timeout from 15000 to 150000. Similar results.

                    However, I noticed that there's a new option with 1.3, in addition to the new INVALIDATION_ASYNC option...

                     <attribute name="NodeLockingScheme">OPTIMISTIC</attribute>
                    


                    Setting this caused all of the lock exceptions to go away!

                    I'm now getting OutOfMemory errors, about 11 minutes into the test, but I need to confirm what the root cause on that is still. It may be unrelated to JBossCache -- I'm not sure yet.

                    Thanks for your help.


                    • 7. Re: JBossCache 1.3 Beta 2
                      manik

                      Optimistic locking will always bypass these locking issues - because the very concept of o/l is that node data is copied, rather than locked, for each transaction. The OOME errors are probably due to the extra memory requirements of o/l (additional memory space to copy node data, etc.)

                      • 8. Re: JBossCache 1.3 Beta 2
                        manik

                        Aaron, this problem seems to be something specific with when used with Hibernate. WHich version of Hibernate is this tested against?

                        Also, what do you do in your load test? Do you start transactions on the same objects to induce concurrency?

                        Cheers,
                        Manik

                        • 9. Re: JBossCache 1.3 Beta 2
                          akardell

                          Hi Manik,

                          We are using Hibernate 3.0.5. The load test isn't specific to JBossCache -- it is a full load test of our application; we aren't going out of our way to create transactions on the same object to induce concurrency, but it probably happens as a 'side effect' of our test.

                          As best as we can, we seem to have isolated JBossCache 1.3 as the source of the out of memory errors in some way shape or form -- if we swap in the 1.2 jars we don't get the out of memory errors. I am now trying to use a profiler to see if we can help identify the source of the memory leak. Our objects are small enough that I don't think copying nodes would be enough to cause out of memory errors unless those nodes are being retained indefinitely.

                          I'll help out however I can here -- hopefully our use of a profiler will highlight the problem area(s).

                          Thanks,

                          Aaron