5 Replies Latest reply on Jan 11, 2010 5:13 AM by galder.zamarreno

    MVCC leading to deadlocks in synchronous replication mode

    vsevel

      Hi,
      I have an application where there is a high probability that 2 or more client calls will hit my server asking for a data that is not known from the cache running on this server. In that case, the server calculates the data (eg: it fetches from the db), puts it in the cache then returns it: simple standard use case.

      The application is deployed on a cluster of 2 two nodes. clients get load balanced, so there is a high probability that client calls, asking for the same data at the same time, do not all land onto the same server.

      Finally, I want to make sure that when the data is in the cache, it is in the other node as well. That is the reason why I decided to use synchronous replication.

      As far as locking is concerned, I first tried with the PESSIMISTIC mode. I ended up with lots of deadlocks where 2 clients would land on 2 differents nodes, get each a local lock, then finally deadlock each other at commit time. On a simple stress test case, I was able to make fail most (almost all) of the client calls.

      I switched to OPTIMISTIC, and the situation improved. in the same stress the failure ratio went from 95% with PESSIMISTIC to 5 to 10%, with no deadlocks at all (all calls returned in less than 200ms).

      Finally, I decided to give MVCC a try. to my surprise, I found that the behavior I was getting was almost identical to the one with PESSIMISTIC. I linked that to the fact that writers immediately take a lock locally, and will take a remote lock at commit time only.

      This leads me to 2 reflections:
      - Am I missing something obvious in either my configuration or my code? (I am attaching the jboss cache config and java code)
      - If this behavior is expected, and since OPTIMISTIC and PESSIMISTIC are supposed to be removed from the code base at some time, what is then the solution for users that cannot afford those deadlocks?

      Thanks,

      Here is the configuration:



      mvcc_treecache-cfg.xml:

      <?xml version="1.0" encoding="UTF-8"?>
      <jbosscache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:jboss:jbosscache-core:config:3.1">
       <locking nodeLockingScheme="MVCC" isolationLevel="REPEATABLE_READ" lockAcquisitionTimeout="10000" lockParentForChildInsertRemove="false" />
       <transaction transactionManagerLookupClass="org.jboss.cache.transaction.GenericTransactionManagerLookup" />
       <clustering mode="replication" clusterName="mvcc_domain_TreeCache_VSE">
       <sync replTimeout="10000" />
       <stateRetrieval fetchInMemoryState="false" timeout="10000" />
       <jgroupsConfig configFile="jbosscache_pessimistic/mvcc_treecache-cfg-jgroupsConfig.xml" />
       </clustering>
       <eviction wakeUpInterval="3000">
       <default algorithmClass="org.jboss.cache.eviction.LRUAlgorithm">
       <property name="maxNodes" value="5000" />
       <property name="timeToLive" value="-1" />
       </default>
       <region name="/myregion" algorithmClass="org.jboss.cache.eviction.LRUAlgorithm" eventQueueSize="">
       <property name="maxNodes" value="1000" />
       <property name="timeToLive" value="-1" />
       </region>
       <region name="/myregion/Users" algorithmClass="org.jboss.cache.eviction.LRUAlgorithm" eventQueueSize="">
       <property name="maxNodes" value="0" />
       <property name="timeToLive" value="172800000" />
       </region>
       </eviction>
      </jbosscache>
      


      mvcc_treecache-cfg-jgroupsConfig.xml:
      <config>
       <UDP mcast_addr="239.202.87.15" mcast_port="11001" ip_ttl="64"
       ip_mcast="true" mcast_send_buf_size="150000" mcast_recv_buf_size="80000"
       ucast_send_buf_size="150000" ucast_recv_buf_size="80000"
       bind_addr="${jbosscache_bind_addr}" loopback="true" />
       <PING timeout="10000" num_initial_members="3" />
       <MERGE2 min_interval="10000" max_interval="20000" />
       <FD_SOCK />
       <VERIFY_SUSPECT timeout="1500" />
       <pbcast.NAKACK gc_lag="50" retransmit_timeout="600,1200,2400,4800" />
       <UNICAST timeout="600,1200,2400,4800" />
       <pbcast.STABLE desired_avg_gossip="40000" />
       <FRAG frag_size="8192" />
       <pbcast.GMS join_timeout="5000" shun="true"
       print_local_addr="true" />
       <pbcast.STATE_TRANSFER />
      </config>
      


      Here is the stateless session bean impl:
      public class TestBean implements SessionBean {
      
       private static final Logger log = Logger.getLogger(TestBean.class);
      
       private static final long serialVersionUID = -1934030976428012964L;
      
       private static Map<String, Cache<String, String>> cacheMap = new HashMap<String, Cache<String, String>>();
      
       private SessionContext sessionContext;
      
       public static Object lookup(String jndi, String url) throws NamingException {
       Properties properties = new Properties();
       properties.put(INITIAL_CONTEXT_FACTORY, WLInitialContextFactory.class.getName());
       properties.put(PROVIDER_URL, url);
       properties.put(ALLOW_GLOBAL_RESOURCE_LOOKUP, "true");
       properties.put(ALLOW_EXTERNAL_APP_LOOKUP, "true");
       properties.put(RELAX_VERSION_LOOKUP, "true");
       return new InitialContext(properties).lookup(jndi);
       }
      
       static synchronized Cache<String, String> getCache(String cacheName) {
       try {
       Cache<String, String> cache = cacheMap.get(cacheName);
       if (cache != null) return cache;
       log.info("initializing cache " + cacheName);
       DefaultCacheFactory<String, String> factory = new DefaultCacheFactory<String, String>();
       cache = factory.createCache("jbosscache_pessimistic/" + cacheName + "_treecache-cfg.xml");
       cacheMap.put(cacheName, cache);
       log.info("initialized cache " + cacheName);
       return cache;
       } catch (Exception e) {
       log.error(cacheName + " init error", e);
       throw new RuntimeException(e.toString(), e);
       }
       }
      
       static synchronized void closeCaches() {
       for (Map.Entry<String, Cache<String, String>> entry : cacheMap.entrySet()) {
       String cacheName = entry.getKey();
       try {
       log.info("closing cache " + cacheName);
       Cache<String, String> cache = entry.getValue();
       cache.stop();
       cache.destroy();
       } catch (Exception e) {
       log.error("unable to close cache " + cacheName, e);
       }
       }
       }
      
       public String test(String cacheKey, String cacheValue, String cacheName, long waittime) throws InterruptedException {
       log.debug("attempting to get user " + cacheKey);
       Cache<String, String> cache = getCache(cacheName);
       Fqn<String> userFQN = Fqn.fromString("/myregion/Users/" + cacheKey);
      
       if (cacheValue == null) {
       log.debug("removing node " + userFQN);
       cache.removeNode(userFQN);
       log.debug("removed node " + userFQN);
       return null;
       }
      
       Node<String, String> user = cache.getNode(userFQN);
      
       if (user == null) {
       log.debug("missing cached " + cacheKey + " => caching " + cacheKey + "=" + cacheValue);
       user = cache.getRoot().addChild(userFQN);
       if(waittime > 0) Thread.sleep(waittime);
       user.put("value", cacheValue);
       return cacheValue;
       } else {
       String value = user.get("value");
       log.debug("return cached " + cacheKey + "=" + value + "; ignoring passed value: " + cacheValue);
       return value;
       }
       }
      


      And finally, the stress test client:
      public class TestSystem {
      
       private static final Logger log = Logger.getLogger(TestSystem.class);
      
       @Test
       public void testServer() throws Exception {
       TestRemote ejb = createService();
       log.info("calling test ejb");
       ejb.test("myuser3", "toto", "optimistic", 0);
       ejb.remove();
       log.info("done");
       }
      
       @Test
       public void mt() throws Exception {
      
       // final String cacheName = "pessimistic";
       final String cacheName = "optimistic";
       // final String cacheName = "mvcc";
       final int count = 10000;
       int threads = 2;
       final String user = "myuser";
       final boolean stopOnFirstError = false;
       final long waittime = 0;
      
       Assert.assertNull(createService().test(user, null, cacheName, 0));
       Assert.assertEquals("warm up", createService().test(user, "warm up", cacheName, 0));
       Assert.assertNull(createService().test(user, null, cacheName, 0));
      
       log.info("warm up ok");
       Thread.sleep(5000);
       log.info("starting test");
       Thread.sleep(1000);
      
       final List<String> success = Collections.synchronizedList(new ArrayList<String>());
       final List<String> errors = Collections.synchronizedList(new ArrayList<String>());
       final List<Long> responseTime = Collections.synchronizedList(new ArrayList<Long>());
      
       for (int i = 0; i < threads; i++) {
       final int id = i;
      
       new Thread() {
       @Override
       public void run() {
       for (int j = 0; j < count; j++) {
       try {
       if (stopOnFirstError && !errors.isEmpty()) return;
       long start = System.currentTimeMillis();
       String result = createService().test(user, "toto from " + id, cacheName, waittime);
       long time = System.currentTimeMillis() - start;
       responseTime.add(time);
       success.add(result);
       if (time > 1000) log.debug("thread " + id + " test " + j + ": executed in " + time);
       // log.debug("thread " + id + " test " + j + ": " + result);
       } catch (Exception e) {
       String s = TestSystem.toString(e);
       log.debug("thread " + id + " test " + j + ": " + s);
       errors.add(s);
       }
       }
       }
       }.start();
       }
      
       int size = 0;
      
       while ((size = success.size() + errors.size()) != count * threads && !(stopOnFirstError && !errors.isEmpty())) {
       try {
       createService().test(user, null, cacheName, 0);
       if (size % 100 == 0) log.debug("tests executed so far: " + size);
       Thread.sleep(200);
       } catch (Exception e) {
       log.error("unable to remove user: " + toString(e));
       }
       }
      
       Collections.sort(responseTime);
       Long maxTime = responseTime.get(responseTime.size() - 1);
       log.info("done: success=" + success.size() + "; error=" + errors.size() + "; max time=" + maxTime);
       }
      
       static String toString(Exception e) {
       Throwable ex = e;
       while (ex.getCause() != null && ex.getCause() != ex)
       ex = ex.getCause();
       return firstLine(ex.toString()) + "(original=" + firstLine(e.toString()) + ")";
       }
      
       static String firstLine(String s) {
       int index = s.indexOf('\n');
       return index == -1 ? s : s.substring(0, index - 1);
       }
      
       private TestRemote createService() throws NamingException, RemoteException, CreateException {
       // String url = "t3://localhost:7001,localhost:8001";
       String jndi = "com.lodh.bank.ejb.TestRemoteHome";
      
       Object home = TestBean.lookup(jndi, url);
       TestRemoteHome h = (TestRemoteHome) PortableRemoteObject.narrow(home, TestRemoteHome.class);
       return h.create();
       }
      }
      


        • 1. Re: MVCC leading to deadlocks in synchronous replication mod
          spennec

          I'm interested in an answer to this problem, too...

          Aren't JBossCache comitters monitoring this forum anymore? It's been almost two weeks since this problem has been posted, and still no answer...

          If it turns out the situation described by vsevel is actually a bug and not a misuse, then JBossCache has a serious problem...

          • 2. Re: MVCC leading to deadlocks in synchronous replication mod
            spennec

            Looks like the JBoss people are back on the forum :)

            Galder, could you take a look at this topic please? I'd love to hear your thoughts on it...

            Thanks!

            • 3. Re: MVCC leading to deadlocks in synchronous replication mod

              As I understand your scenario is not primary use case for JBossCache. People from Hibernate brought up this issue some time ago and as the result the method putForExternalRead was added:

              /**
               * Under special operating behavior, associates the value with the specified key for a node identified by the Fqn passed in.
               * <ul>
               * <li> Only goes through if the node specified does not exist; no-op otherwise.</i>
               * <li> Force asynchronous mode for replication to prevent any blocking.</li>
               * <li> invalidation does not take place. </li>
               * <li> 0ms lock timeout to prevent any blocking here either. If the lock is not acquired, this method is a no-op, and swallows the timeout exception.</li>
               * <li> Ongoing transactions are suspended before this call, so failures here will not affect any ongoing transactions.</li>
               * <li> Errors and exceptions are 'silent' - logged at a much lower level than normal, and this method does not throw exceptions</li>
               * </ul>
               * This method is for caching data that has an external representation in storage, where, concurrent modification and
               * transactions are not a consideration, and failure to put the data in the cache should be treated as a 'suboptimal outcome'
               * rather than a 'failing outcome'.
               * <p/>
               * An example of when this method is useful is when data is read from, for example, a legacy datastore, and is cached before
               * returning the data to the caller. Subsequent calls would prefer to get the data from the cache and if the data doesn't exist
               * in the cache, fetch again from the legacy datastore.
               * <p/>
               * See <a href="http://jira.jboss.com/jira/browse/JBCACHE-848">JBCACHE-848</a> for details around this feature.
               * <p/>
               *
               * @param fqn <b><i>absolute</i></b> {@link Fqn} to the {@link Node} to be accessed.
               * @param key key with which the specified value is to be associated.
               * @param value value to be associated with the specified key.
               * @throws IllegalStateException if {@link #getCacheStatus()} would not return {@link org.jboss.cache.CacheStatus#STARTED}.
               */
               void putForExternalRead(Fqn fqn, K key, V value);
              



              • 4. Re: MVCC leading to deadlocks in synchronous replication mod
                spennec

                We'll try this out, thanks!

                • 5. Re: MVCC leading to deadlocks in synchronous replication mode
                  galder.zamarreno
                  Deadlocks due to nodes modifying the same data at the same time can always happen as long as locks are acquired at commit time. In Infinispan, there's the option to do eager locking which avoids this type of issues. Eager locking will not be implemented in JBoss Cache since this is maintenance mode.