5 Replies Latest reply on Sep 4, 2010 4:07 AM by timfox

    Shared store and file locks

    jmesnil

      I am testing shared store with real file locks
      First case:
      - node0 server is started
        => create live.lock
        => lock it`
        => node 0 is live
      - node1 server is started
        => wait to lock live.lock
        => node1 is backup waiting to failover
      Now what happens when node0 is stopped?
      currently, it will delete the file and unlock it.
      since node0 has no longer a lock on it, node1 will in turn lock it and becomes "live".
      But the file is no longer there!
      When the node0 is restarted (still with a live server configuration), it will not check that if the file exists
      but always recreate it and lock it
      => node0 is "live" too at the same time than node1!
      To fix this, I'll change the algorithm:
      Shared Live Activation
      * it will wait in SharedStoreLiveActivation.run() if there is *already* a live.lock file and wait until it no longer exists
      => this way, a live node will not start if a backup has failed over and become live
      * in SharedStoreLiveActivation.close(), we still delete the file and unlock
      Shared Backup Activation
      * in SharedStoreBackupAction.run(), we wait to lock the live.lock file. When the method returns, we check if the file still
      exists. If it's not the case, we lock again to recreate the file (that's ok, we already hold the lock)
      => this way, a backup node which has failed over will have the lock and the file will still exists
      * in SharedStoreBackupAction.close(), if we unlock from live.lock (i.e. the node became live), we also make sure to delete the file.

      I am testing shared store with real file locks

       

      First case:

      - node0 server is started

        => create live.lock

        => lock it

        => node 0 is live

      - node1 server is started

        => wait to lock live.lock

        => node1 is backup waiting to failover

       

      Now what happens when node0 is stopped?

       

      currently, it will delete the file and unlock it.

      since node0 has no longer a lock on it, node1 will in turn lock it and becomes "live".

      But the file is no longer there!

       

      When the node0 is restarted (still with a live server configuration), it will not check that if the file exists

      but always recreate it and lock it

      => node0 is "live" too at the same time than node1!

       

      To fix this, I'll change the algorithm:

       

      Shared Live Activation

      * it will wait in SharedStoreLiveActivation.run() if there is *already* a live.lock file and wait until it no longer exists

      => this way, a live node will not start if a backup has failed over and become live

      * in SharedStoreLiveActivation.close(), we still delete the file and unlock

       

      Shared Backup Activation

      * in SharedStoreBackupAction.run(), we wait to lock the live.lock file. When the method returns, we check if the file still

      exists. If it's not the case, we lock again to recreate the file (that's ok, we already hold the lock)

      => this way, a backup node which has failed over will have the lock and the file will still exists

      * in SharedStoreBackupAction.close(), if we unlock from live.lock (i.e. the node became live), we also make sure to delete the file.

      => whether we stop a live node or a backup node which has failed over, there must be no live.lock file once the server is stopped

        • 1. Re: Shared store and file locks
          timfox

          Jeff, you shouldn't need to change anything here, I had it working before handover.

           

          When node 0 closes, it deletes the lock file, see:

           

          if (liveLock != null)
                   {
                      // We need to delete the file too, otherwise the backup will failover when we shutdown or if the backup is
                      // started before the live

           

                      File liveFile = new File(configuration.getJournalDirectory(), "live.lock");

           

                      liveFile.delete();

           

                      liveLock.unlock();

           

                   }

           

          Since the file has been deleted, this means any backup waiting to lock the file will not succeed and won't become live. As the comment says this prevents any backup becoming live when the live node is shutdown cleanly.

          • 2. Re: Shared store and file locks
            timfox

            Jeff Mesnil wrote:


            Shared Backup Activation

            * in SharedStoreBackupAction.run(), we wait to lock the live.lock file. When the method returns, we check if the file still

            exists. If it's not the case, we lock again to recreate the file (that's ok, we already hold the lock)

            This code already exists!:

             

            while (true)
                        {
                           File liveLockFile = new File(configuration.getJournalDirectory(), "live.lock");

             

                           while (!liveLockFile.exists())
                           {
                              log.info("Waiting for server live lock file. Live server is not started");

             

                              Thread.sleep(2000);
                           }

             

                           liveLock = createLockFile("live.lock", configuration.getJournalDirectory());

             

                           log.info("Live server is up - waiting for failover");

             

                           liveLock.lock();

             

                           // We need to test if the file exists again, since the live might have shutdown
                           if (!liveLockFile.exists())
                           {
                              liveLock.unlock();

             

                              continue;
                           }

             

                           log.info("Obtained live lock");
                          
                           // Announce presence of live node to cluster
                          
                          
                           break;
                        }

             

            See the comment "// We need to test if the file exists again, since the live might have shutdown"

            • 3. Re: Shared store and file locks
              jmesnil

              Tim Fox wrote:

               

              Jeff Mesnil wrote:


              Shared Backup Activation

              * in SharedStoreBackupAction.run(), we wait to lock the live.lock file. When the method returns, we check if the file still

              exists. If it's not the case, we lock again to recreate the file (that's ok, we already hold the lock)

              This code already exists!

              code was commented, I uncommented it.

               

              I have the failover tests all passing using remote hornetq servers (except one which is doing dirty things with the replication endpoint...)

               

              I also fix the tests when run with invm servers + fakelock: when the server simulates a crash, I recreate the live.lock file before clearing the fake locks.

              This way, the backup server can properly fail over.

              • 4. Re: Shared store and file locks
                jmesnil

                I also have an issue with the FileLock impementation on Mac OS X JVM.

                When I interrrupt the backupActivationThread, the call to interrupt() never returns and the call to lock.lock() is not interrupted.

                 

                I checked on Linux using Sun^H^H^HOracle(TM) VM (1.6.0_14) and I got the expected FileLockInterruptedException and the code work.

                Andy had a NullPointerException. Andy, which JVM did you use? OpenJDK?

                As long as we got an exception, that's fine, the code will unblock.

                But on Mac, it never returns and it is impossible to stop a backup server waiting to failover

                • 5. Re: Shared store and file locks
                  timfox

                  That definitely sounds like a bug in the Mac JVM. Can you do a bug search to see if it's a known issue?

                   

                  In the mean-time I'd concentrate on getting it to run on Linux, in reality very few people will be using OSX on the server.