5 Replies Latest reply on Aug 12, 2009 1:27 PM by imbng

    Why a continuous hash?

    imbng

      I'm curious to know why a continuous hash was chosen for DIST?

      It would seem to me that a consistent hash would be far easier to implement and provide much more control to allow for grouping like data together and managing partitions.

      Of course it it wouldn't be dynamic but that is a double edged sword and not always desirable.

      Anyhow, just curious to know your design decisions.

        • 1. Re: Why a continuous hash?
          imbng

          Not sure where my brain what when I was reading the docs but it plainly says you're using a consistent hash.

          What may have confused me was the section on rehashing. What's the need for a rehash if you're using a consistent hash?

          My guess is there is no was to define the number of shards/partitions. Clusters simply come and go so the total number is always changing and the cluster is what the keys hash to.

          Other implementations I've played with all allow you to define the number of shards/partitions up front and then scatters those across all the running containers. The keys hash to shards/partitions.

          Using the cluster as the unit of partition seems complex as you would need to rehash and move data around. You would also have to move the shards/partitions in the other case but rules can be written for what moves when and when how since the shard/partition is decoupled from the hashing.

          I may have the terminology wrong here since I'm just starting to look at this since everyone seems to use different terms.

          1 of 1 people found this helpful
          • 2. Re: Why a continuous hash?
            manik

            You're pretty much spot-on. The need for rehashing is due to nodes joining/leaving the cluster.

            I have considered the shard/partition approach (or virtual nodes as I called them), but that would require some form of global metadata.

            • 3. Re: Why a continuous hash?
              imbng

              Yes, there would be a need for global state but there is already some of that in the current implementation is there not?

              You have to know how many clusters (to hash correctly) and where all the clusters are (to route requests).

              Isn't that basically the same metadata you'd need if going with virtual nodes?

              • 4. Re: Why a continuous hash?
                manik

                no, for virtual nodes you'd need some added metadata including what each vnode hashes to in a given hash space, as well as which vnodes map to real nodes. The latter would be prone to change if there is a cluster reorganisation event (nodes joining or leaving) as vnodes could be assigned to different actual nodes.

                1 of 1 people found this helpful
                • 5. Re: Why a continuous hash?
                  imbng

                  True, you'd need the initial configuration setting that specifies how many total vnodes there would be and that would need to be global. From that configuration you can easily calculate where a key hashes using a modulo hash (key_hashcode % total_vnode_count).

                  As for the 2nd bit of state don't you already have that today or at least most of it? I've looked into JGroups which I believe you're using to some degree so that state/membership info is already tracked and is available. You may not have the mapping of vnodes to real nodes (depending on how they're registered) but the infrastructure and symatics are there to support it I'd guess.

                  Anyhow, interesting discussion and I'm glad to see this product in the portfolio.