6 Replies Latest reply on Jan 22, 2008 10:14 AM by Brian Stansberry

    Data gravitation cleanup process and stale nodes

    Manik Surtani Master

      This is related to JBCACHE-1258.

      Currently there is an issue where stale structural nodes are not cleaned up after a gravitation event. There are 2 cases to consider, really - let's start with the first, and easier of the two:

      1. Data owner has been shut down or has crashed.

      Lets consider a cluster of 3 servers, with the following state:

      Server A:
       /a/b/c
      
      Server B:
       /_BUDDY_BACKUP_/Server_A/a/b/c
      
      Server C:
      


      Now server A shuts down (or crashes).

      Server B:
       /_BUDDY_BACKUP_/Server_A/a/b/c
      
      Server C:
      


      And a request for /a/b/c comes in to Server C. Server C then broadcasts a gravitation request, followed by a gravitation cleanup call. What we are now left with is:

      Server B:
       /_BUDDY_BACKUP_/Server_A/a/b
       /_BUDDY_BACKUP_/Server_C/a/b/c
      
      Server C:
       /a/b/c
      



      The reason for this is that the Fqn requested is /a/b/c, and hence the Fqn gravitated is /a/b/c and, similarly, the gravitation cleanup call broadcast is for /a/b/c. So Server B removes backup state for /a/b/c, but /a/b still remains which can consume unnecessary memory.

      I suggest that parent nodes are removed as well, during a cleanup call, provided there are no other children. So we wind our way back up the tree and remove all parents up to /_BUDDY_BACKP_/Server_A. And finally, if /_BUDDY_BACKUP_/Server_A is also empty, and Server A is no longer in the cluster, the buddy backup region should be removed as well.

      Does anyone see this causing problems?

      2. Data owner is still alive.

      The same initial state:

      Server A:
       /a/b/c
      
      Server B:
       /_BUDDY_BACKUP_/Server_A/a/b/c
      
      Server C:
      


      Server A is sill alive, but Server C asks for /a/b/c

      Server A:
       /a/b
       /_BUDDY_BACKUP_/Server_C/a/b/c
      
      Server B:
       /_BUDDY_BACKUP_/Server_A/a/b
       /_BUDDY_BACKUP_/Server_C/a/b/c
      
      Server C:
       /a/b/c
      


      Now going thru parents during a gravitation cleanup and removing empty nodes in Server B's backup region may make sense, but can this be applied to A's main tree as well? This is where things get tricky since application logic may depend on the existence of /a or /a/b, even if they are empty. It doesn't matter on Server B since this is a backup region and the application on B has no direct access to this region.

      Thoughts? Just make it a configurable property? gravitationCleanupPolicy, which could be set to LEAVE_PARENTS, CLEAN_EMPTY_PARENTS_ON_BACKUP, CLEAN_ALL_EMPTY_PARENTS with the last one being default? I'm concerned we're already en-route to configuration hell, and am keen to limit configuration parameters to the absolute minimum. If we have a sensible enough a default, I'd rather not make this configurable.