1 Reply Latest reply on May 31, 2010 8:55 AM by michael.wengle

    Broker using shared file system: NFS problems

    lufe

      In our pilot project, we have been trying to achieve a master/slave configuration that works and is bullet proof.

       

      For this, we use the shared file system approach, and to implement the file locking mechanism, we use NFS. The goal is for the slave to kick in as soon as possible when the master "dies".

       

      This works, but not always. It works in the following situations:

       

      - JVM crashes - the lock is removed on the NFS file and the slave kicks in;

      - Normal process end from master - the lock is also removed

       

      When it doesn't work:

       

      - remove the network cable from the computer running the master. Since the NFS client doesn't have a chance to release the lock in this scenario, the lock stays up forever, without the NFS server being aware that the client is "dead". The result is that the slave never kicks in.

       

      The configuration used is:

       

      - Solaris system with NFS v4

      - 2 windows NFS clients with access to the NFS directory where the lock file is placed.

       

      I've inquired and the solutions I collected were:

       

      - use GFS. Problem: GFS is only available to Linux, and the solution should be generic (at least for any Unix server)

      - recompile the solaris server to change an obscure tcp/ip parameter which would - possibly - change the timeout time for the NFS to realize that the client is dead. Problem: we don't have access to change kernel parameters of our client's computers, and we don't know side effects of doing that.

       

      None of these are what we need. Is there someone that run on the same problem, and found a solution to it?

        • 1. Re: Broker using shared file system: NFS problems
          michael.wengle

          Lufe and I guess that we have a problem with the Windows NFS client. The client is only NFS 3 and the Server is NFS 4. According to the URL below it should work with NFS 4 but we didn't test it (yet).

           

           

          http://blogs.netapp.com/eislers_nfs_blog/2008/07/part-i-since-nf.html

           

          The NFSv4 client is required to renew its leased in a fixed, known time before the term of the lease expires. If the client fails to renew in time the locks it has are subjection to revocation. So lets say the client crashes and never restarts. Then the locks it had will automatically expire, allowing other clients to acquire the locks.

           

          When an NFSv4 server restarts, as with NLM, there is a grace period for reclaim. But there is no NSM-like notifications to the client nor are any needed. Because the client has to renew its lease, it will eventually find out about the server restart on a renew attempt. The grace period is at least as long as the period of the lease, allowing any client that had locks sufficient time to find out that it has lost its lease and its locks. Thus it is not possible for to clients to think they have the same lock on the same file.