3 Replies Latest reply on Jan 12, 2011 7:52 AM by manik

    Issues when adding a new node into a running cluster

    changgeng

      I try to use the attached unit test case to simulate the issues we are facing.

      The environment contains 5 or more nodes, and 2 named cache: A and B. The cluster is configured in distribution mode with number of owners set to 3.

      Eager locking and lockOnSingleNode is turned on. The test steps are:

      1) start 5 different cache managers to simulate 5 nodes

      2) populate the two caches with 2000 keys each

      3) start one hundred worker threads. Each worker thread will randomly select a node, and execute as the following semi-pesudo code in a loop:

        transactionManager.begin()

        //randomly select a key.

        key=random.nextInt(keySize)

        cacheA.getAdvancedCache().lock("cacheA" + key);

        cacheB.getAdvancedCache().lock("cacheB" + key);

       

        cacheA.put("cacheA"+key,cacheA.get("cacheA" +key)+1);

        cacheB.put("cacheB"+key,cacheB.get("cacheA" +key)+1);

        transactionManager.commit()

      4. After the worker threads are started, all the workers are runnning smoothly, but when another cache manager is started to join the cluster, there will be a lot of errors about lock timeout.

       

      In the real system, the newly added node can not be functional at all, and for the existing node, they can not process some transaction correctly as some keys are locked, and the lock seems never be released.

       

      I'm testing with 4.2.0.Final.

      Please look at the test case and try to run with it.