7 Replies Latest reply on Aug 12, 2011 1:04 PM by Joe Planisky

    Lock/Replication timeouts on 3 node EC2

    Joe Planisky Newbie

      I'm trying to run a replicated cache on 3 nodes (and eventually more) using Infinispan 5.0.0.FINAL in the Amazon EC2 cloud and I'm running into intermittent TimeoutExceptions.  Most of the time, I get an "Unable to acquire lock after [10 seconds]...", but sometimes it's a "Replication timeout..." (see attached log file excerpt for details.)

       

      I've narrowed things down to a simple demo program (see attached file), the essence of which is this:

          EmbeddedCacheManager mgr = new DefaultCacheManager("testconfig.xml");
          Cache < String, String > cache = mgr.getCache("TestCache");
          try {
              cache.put("c", "start");
          } catch (Exception x ) {}
          while (true) {
              System.out.println("*************");
              System.out.println("Before update: " + cache.get("c"));
              String d = myIp +  " " + new Date().toString();
              try {
                  cache.put("c", d);
              } catch (Exception x) {}
              System.out.println(" After update: " + cache.get("c"));
              System.out.println("*************");
              Thread.sleep(1000);
          }
      

       

      I start this program on the 1st node and wait until it's up and running.  Then I start the 2nd node.  When the 2nd node is starting, both the 1st and 2nd nodes seem to freeze for about 15 seconds, but eventually they resume running and I see the expected console outputs on both.  When I start the 3rd node, again I see all nodes freeze for about 15 seconds, but usually everything recovers and I see the expected outputs on all 3 nodes.

       

      However, after a variable amount of time (a few seconds to a minute or more), I will see the nodes freeze again and after about 10 seconds I'll see the TimeoutExceptions on 2 of the nodes and the 3rd one will just continue where it paused. 

       

      In my jGroups configuration, I'm using TCP transport and S3_PING membership discovery.  (I've also used the FILE_PING discovery with the same results, so I don't think it's an S3 issue.). 

       

      The significant portion of my Infinispan configuration is:

      <namedCache name="TestCache">
        <deadlockDetection enabled="true"/>
        <unsafe unreliableReturnValues="false" />
        <locking concurrencyLevel="1000" useLockStriping="false" lockAcquisitionTimeout="10000" />
        <clustering mode="replication">
          <sync />
          <stateRetrieval fetchInMemoryState="true"/>
        </clustering>
      </namedCache>
      

       

       

      I've attached my complete Infinispan and jGroups configuration files.

       

      I'm using:

      • Infinispan 5.0.0.FINAL
      • Ubuntu 10.04 (kernel 2.6.32-308-ec2)
      • Java 1.6.0_20

       

      Do I have a configuration problem?  Am I not using Infinispan correctly? Any hints on how to fix or work around this issue?

       

      --

      Joe