4 Replies Latest reply on Nov 15, 2004 3:02 AM by belaban

    Transactions/Locking Across VMs Issues

    jiwils

      I have been attempting to get transactional access to a node when inserting key/value pairs into a node's map. However, no matter what I do this does not work for me.

      I have three VMs running on a multiprocessor box that each contain an instance of a cache. All three VMs are reading the same cache configuration file. VMs 1 and 2 are running client code that is putting a key/value pair in a HashMap. The key is the same, so by using transactions I would like for one of the clients to successfully "put" the key/value pair, and I would like the other VM to recognize that the value has already been "put". VMs 1 and 2 are started as close as possible to the same time in order to ensure that transaction locking is occurring.

      The code for VMs 1 and 2 is as follows:

       _cache.startService();
      
       // Sleep to ensure that cache start is finished (does no good).
       System.out.println("Sleeping...");
       Thread.sleep(10000);
       System.out.println("Done sleeping.");
      
       UserTransaction tx =
       new DummyUserTransaction(DummyTransactionManager.getInstance());
      
       try
       {
       tx.begin();
      
       if (! _cache.exists("aom", "key"))
       {
       System.out.println("Inside if...");
      
       Object last = _cache.put("aom", "key", _value);
       tx.commit();
      
       if (last != null)
       {
       System.out.println("Last was not null...");
       System.out.println(last);
      
       // tx.rollback();
       }
       else
       {
       System.out.println("Last was null...");
       }
       }
       else
       {
       System.out.println("already exists");
       }
       }
       catch (Throwable t)
       {
       t.printStackTrace(System.out);
       }
      

      The 3rd VM is just a "listener", but it does not implement the TreeCacheListener interface. It just prints out the key/value pairs of the node that the other VMs are working with so I can determine which VM won in the race I am creating. It prints out these values once every 30 seconds.

      I start the listener VM first, and put an arbitrary key/value pair into the node that every VM is using in order to make sure the node already exists. Then I start the other two VMs as close to simultaneously as possible.

      All three VMs get timeout exceptions, and neither VM 1 nor VM 2 place anything in the map (one of them does sometimes but it is auto-rolled back because of the timeout exception). VM 3 appears to hang sometimes when it locks the cache to update VM 1 / VM 2 because only a single "locking" message appears for state transition rather than one for each of the other VMs, but this does not happen every time.

      This is the listener VM's timeout stacktrace log:
      00:07:27,584 ERROR [TreeCacheAop] method invocation failed
      org.jboss.cache.lock.TimeoutException: lock could not be acquired after 15000 ms. Lock map ownership Read lock owners: []
      Write lock owner: <devns02:34215>:3
      
       at org.jboss.cache.lock.IdentityLock.acquireWriteLock(IdentityLock.java:146)
       at org.jboss.cache.Node.acquireWriteLock(Node.java:422)
       at org.jboss.cache.Node.acquire(Node.java:388)
       at org.jboss.cache.TreeCache.findNode(TreeCache.java:3295)
       at org.jboss.cache.TreeCache._put(TreeCache.java:2341)
       at org.jboss.cache.aop.TreeCacheAop._put(TreeCacheAop.java:611)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:324)
       at org.jgroups.blocks.MethodCall.invoke(MethodCall.java:223)
       at org.jboss.cache.TreeCache.prepare(TreeCache.java:2736)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:324)
       at org.jgroups.blocks.MethodCall.invoke(MethodCall.java:223)
       at org.jboss.cache.interceptors.CallInterceptor.invoke(CallInterceptor.java:14)
       at org.jboss.cache.interceptors.Interceptor.invoke(Interceptor.java:39)
       at org.jboss.cache.interceptors.ReplicationInterceptor.replicate(ReplicationInterceptor.java:144)
       at org.jboss.cache.TreeCache._replicate(TreeCache.java:2674)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
       at java.lang.reflect.Method.invoke(Method.java:324)
       at org.jgroups.blocks.MethodCall.invoke(MethodCall.java:223)
       at org.jgroups.blocks.RpcDispatcher.handle(RpcDispatcher.java:220)
       at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:615)
       at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:512)
       at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:326)
       at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUp(MessageDispatcher.java:722)
       at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.access$300(MessageDispatcher.java:554)
       at org.jgroups.blocks.MessageDispatcher$1.run(MessageDispatcher.java:691)
       at java.lang.Thread.run(Thread.java:534)
      00:07:27,588 ERROR [RpcDispatcher] failed invoking method
      org.jboss.cache.lock.TimeoutException: lock could not be acquired after 30000 ms. Lock map ownership Read lock owners: []
      

      My question is why is this happening? The code above takes way less time to execute than the configured 15 seconds for a timeout (I even tried it at a higher value), and the 3rd VM should only have an issue if it is in one of its "read" cycles (and I am making sure it is not).

      The cache configuration sets the isolation level to SERIALIZABLE and the replication method to REPL_SYNC.

      Any help/other ideas would be readily accepted...

        • 1. Re: Transactions/Locking Across VMs Issues
          belaban

          #1 exists() doesn't acquire a lock, use get(). Here's the javadoc:
          /**
          * Checks whether a given node exists in the tree. Does not acquire any locks in doing so (result may be dirty read)
          * @param fqn The fully qualified name of the node
          * @return boolean Whether or not the node exists
          * @jmx.managed-operation
          */

          #2: you try to update the same data on 3 nodes concurrently. Because you use REPL_SYNC, TX commit runs a two phase commit protocol, which needs to lock all data on all 3 nodes. However, because other TXs on those nodes already hold on to the locks, in most cases you will time out attempting to acquire the lock, and then rolling back your TX.
          This is a typical distributed deadlock. If one TX manages to acquire all locks on all 3 servers before timing out, b/c the other TXs rolled back, then that one TX will succeed.
          So the chance to succeed is that small timeframe where no other TX holds a lock. Of course, the more servers access the same data, the smaller the timeframe.

          Bela

          • 2. Re: Transactions/Locking Across VMs Issues
            jiwils

             

            "bela" wrote:
            #1 exists() doesn't acquire a lock, use get().


            We tried the top-level if with both and exist and get to see if it made any difference before I posted, because we thought this might be the case. Thanks for (correctly) pointing back to the Javadoc.

            "bela" wrote:
            #2: you try to update the same data on 3 nodes concurrently. Because you use REPL_SYNC...This is a typical distributed deadlock.


            Thanks for the distributed deadlock explanation. I knew that there was something I was missing. For us, the locking is not so important (so we will probably not use transactions).

            What are the solutions for this type of problem? Random sleep times before retrying if a timeout occurs? What about a master node that hands out locks/blocks when called? It seems like there would be a pretty standard set of possibilities when this type of situation is encountered.

            • 3. Re: Transactions/Locking Across VMs Issues
              jiwils

              By refactoring the above code to retry a configurable number of times and by randomizing the time to wait between retries, I was able to get the behavior I expected. Essentially, this code takes a concurrent event, and makes it not concurrent (avoiding the distributed deadlock situation altogether).

              The refactored code:

               public void test()
               throws Exception
               {
               _cache.startService();
              
               try
               {
              
               // Try to insert into the cache a maximum of 2 times.
               int retryCount = 2;
              
               for (int i = 0; i < retryCount; i++)
               {
               try
               {
               putInCache("key", _value);
               }
               catch (AlreadyBound ab)
               {
               throw ab;
               }
               catch (Exception e)
               {
               System.out.println();
               System.out.println("===== Try #" + i + " =============================");
               e.printStackTrace(System.out);
               System.out.println("==========================================");
               System.out.println();
              
               // Wait at least 100 milliseconds.
               Thread.sleep(_random.nextInt(4900) + 100);
              
               continue;
               }
              
               break;
               }
               }
               catch (Throwable t)
               {
               t.printStackTrace(System.out);
               System.out.println();
               }
               finally
               {
               System.out.println();
               System.out.println("Done.");
               System.out.println();
               }
               }
              
               private void putInCache(String key, String value)
               throws Exception
               {
               UserTransaction tx =
               new DummyUserTransaction(DummyTransactionManager.getInstance());
              
               tx.begin();
              
               Object previous = _cache.put("aom", "key", _value);
              
               if (previous != null)
               {
               tx.rollback();
              
               throw new AlreadyBound();
               }
               else
               {
               tx.commit();
               }
               }
              

              Anyone else have other ideas on how to handle distributed deadlock situations?

              • 4. Re: Transactions/Locking Across VMs Issues
                belaban

                JBoss itself (the appserver) uses random sleeps and retries (up to 5 times). Unless you have distributed deadlock detection, this is the only way to recover from deadlocks.
                How to avoid this ? Work on separate data sets in different nodes, e.g. node1 works on data 1-10, node2 on 11-20, node 3 on 21-30 etc. Sometimes this is easy to implement, e.g. in HTTP session replication you can enable session stickiness.

                Bela