6 Replies Latest reply on Jan 22, 2008 10:14 AM by brian.stansberry

    Data gravitation cleanup process and stale nodes

    manik

      This is related to JBCACHE-1258.

      Currently there is an issue where stale structural nodes are not cleaned up after a gravitation event. There are 2 cases to consider, really - let's start with the first, and easier of the two:

      1. Data owner has been shut down or has crashed.

      Lets consider a cluster of 3 servers, with the following state:

      Server A:
       /a/b/c
      
      Server B:
       /_BUDDY_BACKUP_/Server_A/a/b/c
      
      Server C:
      


      Now server A shuts down (or crashes).

      Server B:
       /_BUDDY_BACKUP_/Server_A/a/b/c
      
      Server C:
      


      And a request for /a/b/c comes in to Server C. Server C then broadcasts a gravitation request, followed by a gravitation cleanup call. What we are now left with is:

      Server B:
       /_BUDDY_BACKUP_/Server_A/a/b
       /_BUDDY_BACKUP_/Server_C/a/b/c
      
      Server C:
       /a/b/c
      



      The reason for this is that the Fqn requested is /a/b/c, and hence the Fqn gravitated is /a/b/c and, similarly, the gravitation cleanup call broadcast is for /a/b/c. So Server B removes backup state for /a/b/c, but /a/b still remains which can consume unnecessary memory.

      I suggest that parent nodes are removed as well, during a cleanup call, provided there are no other children. So we wind our way back up the tree and remove all parents up to /_BUDDY_BACKP_/Server_A. And finally, if /_BUDDY_BACKUP_/Server_A is also empty, and Server A is no longer in the cluster, the buddy backup region should be removed as well.

      Does anyone see this causing problems?

      2. Data owner is still alive.

      The same initial state:

      Server A:
       /a/b/c
      
      Server B:
       /_BUDDY_BACKUP_/Server_A/a/b/c
      
      Server C:
      


      Server A is sill alive, but Server C asks for /a/b/c

      Server A:
       /a/b
       /_BUDDY_BACKUP_/Server_C/a/b/c
      
      Server B:
       /_BUDDY_BACKUP_/Server_A/a/b
       /_BUDDY_BACKUP_/Server_C/a/b/c
      
      Server C:
       /a/b/c
      


      Now going thru parents during a gravitation cleanup and removing empty nodes in Server B's backup region may make sense, but can this be applied to A's main tree as well? This is where things get tricky since application logic may depend on the existence of /a or /a/b, even if they are empty. It doesn't matter on Server B since this is a backup region and the application on B has no direct access to this region.

      Thoughts? Just make it a configurable property? gravitationCleanupPolicy, which could be set to LEAVE_PARENTS, CLEAN_EMPTY_PARENTS_ON_BACKUP, CLEAN_ALL_EMPTY_PARENTS with the last one being default? I'm concerned we're already en-route to configuration hell, and am keen to limit configuration parameters to the absolute minimum. If we have a sensible enough a default, I'd rather not make this configurable.



        • 1. Re: Data gravitation cleanup process and stale nodes
          manik

          Any thoughts on this?

          • 2. Re: Data gravitation cleanup process and stale nodes
            brian.stansberry

            Only vague thought that comes to mind is a concern about data versions with OL. That is, /a/b have a particular data version. Say a special DataVersion impl like we use in Hibernate caching for structural nodes, where we never report the node as being out-of-date. The removal destroys that on the backup node. Then the data owner writes /a/b/d. The special DataVersion impl doesn't replicate to the buddy for /a or /a/b, since the owner didn't think it had written to those nodes. So the buddy just inserts a default DataVersion. Now the trees are inconsistent.

            I guess, in general I think the idea of making changes to the backup tree when the data owner is still alive is a bad idea.

            • 3. Re: Data gravitation cleanup process and stale nodes
              manik

              Agreed re: making changes when the data owner is still alive. We should assume that session stickiness will be available and gravitation will never occur unless the data owner dies.

              Which leads on to the first case, of doing such a cleanup after the data owner dies.

              • 4. Re: Data gravitation cleanup process and stale nodes
                brian.stansberry

                That seems OK, since a member excluded from the view shouldn't be allowed to reappear and begin replicating writes to his old backup trees.

                I'm with you on not making this configurable. :) Seems like either it's something that works well and we just do it, or there's a problem with the idea and we shouldn't do it at all. The only (end-user) reason I could see for not doing it would be if the presence of an empty node /a/b had some meaning to the application; by removing the node itself we change meaning. But that seems pretty wacky.

                • 5. Re: Data gravitation cleanup process and stale nodes
                  manik

                  Y'know, I'm leaning towards pushing this out for 2.2 or something. It won't be an API change or anything.

                  • 6. Re: Data gravitation cleanup process and stale nodes
                    brian.stansberry

                    +1. As a JBC user, this isn't a big deal to me. I'd much rather have 2.1 be as stable as possible. Plus, more time to think about any downsides or reasons to make it configurable.