12 Replies Latest reply on Jul 2, 2009 9:59 AM by Andy Taylor

    Good article on consistent hashing

    Tim Fox Master

      http://weblogs.java.net/blog/tomwhite/archive/2007/11/consistent_hash.html

      Although I am still not convinced we can use a consistent hashing approach, since I'm not sure we can allow groups to migrate from one node to another.

        • 2. Re: Good article on consistent hashing
          Andy Taylor Master

          I think we agree that using a consistent hashing approach will not work.

          So I think some sort of consensus algorithm is the way to go.

          Ive looked at the paxos algorithm which basically allows proposers to propose values creating a list of proposed values and then each node selecting the same proposed value. Since we always want to propose the local node if possible I've come up with the following consensus algorithm which i think will work.

          when a message contains a group id the following happens before routing to a queue in BindingsImpl.

          If there is only a local queue then the message is delivered to the local queue and bound. This basically means we don't have a clustered queue.

          If there are remote queues and we haven't yet bound to any queue then the proposing node will propose its own local queue and send a propose message to each other node. Each node will return an accept message back to the node keeping a record of the groupid and the outcome. This is the simplest scenario where only one node receives messages with a particular group id and becomes bound to the group id.

          If one of the other nodes has previously received a proposal for that groupid then it will decline the proposal. Basically this means that 1 or more nodes are vying to obtain the group binding. In the simplest scenario here 1 of the vying nodes will receive a majority of accepts and becomes bound to the group id.

          If one of the other nodes has already bound to that groupid or it is routing messages to another node bound with this group id then it will also decline. The decline will also notify the proposer of the correct node to use. The proposer will then give up and route to this node.

          Once the proposing node has received enough accepts (a majority of the other nodes, no of other nodes/2 + 1) then it will assign itself as the node for that groupid and send an accepted message to each other node.

          If a node recieves a majority of declines then it will either receive a decline containing the node to forward to or an accepted message with the same info.

          There is a possibility that if 2 or more nodes send a proposal at the same time and both decline the other that a stalemate may occur. consider the following:

          N1 send proposal to N2
          N1 send proposal to N3
          N2 send proposal to N1
          N2 send proposal to N2
          N3 sends accept to N1
          N3 sends decline to N2
          N3 sends decline to N1
          N1 sends decline to N2

          at this point N2 knows it doesn't have the consensus however it has already sent a decline to N1. Since N1 only has 50% consensus it still doesn't have the majority. What we do here is to allow N2 to change its mind. It knows it has already declined N1 so at the point where it has received enough declines it sends an accept message to N1. N1 then has the consensus. so this would be:

          N1 send proposal to N2
          N1 send proposal to N3
          N2 send proposal to N1
          N2 send proposal to N2
          N3 sends accept to N1
          N3 sends decline to N2
          N3 sends decline to N1
          N1 sends decline to N2
          N2 has 2 declines do sends an accept to N1
          N1 binds to the group and sends an accepted message

          This however only works when there are an odd number of nodes. If there were 4 nodes it would be possible that N1 and N2 both have 50% accepts and 50% declines. If this happens then we again use a consistent algorithm to decide. so this would be:

          N1 send proposal to N2
          N1 send proposal to N3
          N1 send proposal to N4
          N2 send proposal to N1
          N2 send proposal to N2
          N2 send proposal to N4
          N3 sends accept to N1
          N3 sends decline to N2
          N4 sends accept to N2
          N4 sends decline to N1
          N3 sends decline to N1
          N1 sends decline to N2
          N1 and N2 boh have 1 accept so the hash the nodeid and send to the smallest/greatest values node, or some similar simple alg
          N1 binds to the group and sends an accepted message


          If as new node comes online then it does not need to know anything until it receives a message with the same group id, at this point it will propose itself but receive 1 or more declines informing it of the node to use.

          If a node disappears the sending node will at that point try to propose itself as a new node and we reach for a new consensus.


          There is also an issue where there may be a window where not a node is only known to a subset of the other nodes. This could happen when a node first comes up and it hasn't yet notified all the nodes in the cluster. In this scenario we may have to force the nodes only to use proposers from nodes known to every other node in the cluster. This is something I will need to look at but is an edge case and i will leave until later.

          thoughts?

          • 3. Re: Good article on consistent hashing
            Tim Fox Master

            Thanks Andy. A few initial questions:

            When you talk about queues on each node I guess you really mean *consumers*.

            E.g. "If there are remote queues and we haven't yet bound to any queue...", should be "If there are remote consumers and we haven't yet bound to any consumer..."

            Since we won't send messages to any node which just has a queue but no consumers. (Grouping works on the consumer level not the queue level)

            Secondly, your algorithm looks kind of similar to one of the Paxos variants. What's the reasoning why none of the Paxos variants are sufficient for our use case?

            • 4. Re: Good article on consistent hashing
              Andy Taylor Master

               


              When you talk about queues on each node I guess you really mean *consumers*.

              E.g. "If there are remote queues and we haven't yet bound to any queue...", should be "If there are remote consumers and we haven't yet bound to any consumer..."


              from what i understand on how clustering works we initially route the message in BindingsImpl and only know about remote queues not what consumers are on them only that there are some. once the message is routed to that queue normal grouping will take over and pin it to a particular consumers. If I have misunderstood how this works can you explain.

              Secondly, your algorithm looks kind of similar to one of the Paxos variants. What's the reasoning why none of the Paxos variants are sufficient for our use case?


              which variant do you think is similar?

              • 5. Re: Good article on consistent hashing
                Tim Fox Master

                 

                "ataylor" wrote:

                When you talk about queues on each node I guess you really mean *consumers*.

                E.g. "If there are remote queues and we haven't yet bound to any queue...", should be "If there are remote consumers and we haven't yet bound to any consumer..."


                from what i understand on how clustering works we initially route the message in BindingsImpl and only know about remote queues not what consumers are on them



                That's not the case. Each node knows about every consumer on every other node, including it's selector. By default JBM will never route messages to a node which doesn't have matching consumers.

                You can turn that behaviour off by setting routeWhenNoConsumers = true in the cluster config (see docs).




                which variant do you think is similar?


                I don't know, but I know Paxos has been studied over many years by many computer scientists, so I'd be sceptical whether our requirements are satisfied by a completely different algorithm. Most likely your algorithm is already one of the variants or your algorithm doesn't satisfy one of our requirements fully.

                • 6. Re: Good article on consistent hashing
                  Andy Taylor Master

                   

                  That's not the case. Each node knows about every consumer on every other node, including it's selector. By default JBM will never route messages to a node which doesn't have matching consumers.


                  What i mean is that the remote consumer isn't chosen by the local node. The message is just forwarded to the remote queue and it makes the decision.

                  • 7. Re: Good article on consistent hashing
                    Tim Fox Master

                    Another good article.

                    At this point I'm thinking we need multi-paxos, which is basically the simplest variant, since we have redundancy built in with our backup servers.

                    • 9. Re: Good article on consistent hashing
                      Tim Fox Master

                      Now, if you really want to be confused, try the original paper http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf ;)

                      • 10. Re: Good article on consistent hashing
                        Andy Taylor Master

                        Here's an explanation of the Paxos algorithm and how it fits our scenario.

                        Basic Paxos is a protocol and we use an instance of this protocol to come to some sort of consensus about which node a group id should be bound to. This means that we would create an instance of the protocol for each group id we need a consensus on.

                        The protocol will have proposers and acceptors and is broken up into 2 phases, 'prepare and promise' and 'accept and accepted. In our case each node will be both proposers and acceptors.

                        1st Phase

                        When a message arrives with a group id set the node receiving the message becomes a proposer.

                        The proposer will send a 'Prepare(N)' message to all the acceptors carrying the queue it proposes to use for the groupid. The N value is a sequence number for that proposer that is unique across the cluster, i.e. each proposer must choose a sequence number from its own disjointed set. This is used when deciding which proposer should take preference, If an acceptor receives a proposal with a sequence number lower than one it has already receives then it declines that proposal, if it is the highest then the acceptor will return a 'promise' that it will not receive any proposals with a lower sequence number, the 'promise' also contains any previously chosen proposals (queues). The proposer with the highest sequence number we refer to as the leader. A promise will contain the sequence number promised, plus all other sequence numbers promised until that point.


                        At this point we move into phase 2 where the value is committed, i.e. chosen. But before i do you may have spotted an issue. If the leader suddenly fails then we are stuck waiting for a value to be chosen. Because of this it is possible for a proposer after failing to re propose with a higher sequence number (it knows the current highest sequence number as it was returned via the previous declined proposal). This however can lead to a situation where a consensus may have duelling proposers, i.e. p1 proposes with seq1, p2 with seq2, p1 with seq3 and so on. Typically for this you make 1 proposer back off before retrying to give another proposer the time to commit.

                        2nd phase

                        At this point any proposers that have received a majority of promises back from the acceptors can choose a value and commit it by sending an accept message. Note that because proposals may be received by different acceptors in different orders this could happen.

                        The accept contains the proposers sequence number and the value chosen. The acceptor will accept if the value chosen is the same as received in the initial proposal and the sequence number is the highest agrred to and send an 'accepted' message to the proposer.

                        If a proposer receives enough accepts to a commit then it deems the value chosen. If it doesn't then the proposer can start again.

                        There are also 'learners' that basically are just notified of any values chosen. A client would normally create a proposer but receive the value as a learner. we may not need to do this, alternatively we could just receive chosen values back when a proposal fails.

                        Multi Paxos

                        This is a variant of Paxos. basically the same instance is used for all consensus decisions, an extra param is sent round containing the instance of the protocol (the groupid we are deciding on). This means that you only need to send a prepare message on the first round. Once a leader is chosen the prepare stage is not needed and you cut down on messages sent. For this to work the leader has to remain stable which typically in our scenarion wouldnt happen, i.e. we want the local node receiving the message always to propose its local queue. however, if we merge the roles of acceptor and proposer the leader will be able to commit on behalf of the initial proposer. i.e. a proposer proposes a new instance of the protocol for a new groupid and sends to all acceptors. The leader will receive this and call accept on its behalf.


                        Using a single node

                        Obviously implementing the above is quite complex. A simple solution that Tim suggested is to assign a specific node as a leader. Basically any other node will always ask the leader to make the decision as to which value to choose, the leader holds all the values in a map for future reference. we could do this via some sort of configuration on the server.

                        thoughts, obviously implementing the latter will be much quicker but less elegant.

                        PS if it hurts your head, try reading the spec :)

                        • 11. Re: Good article on consistent hashing
                          Tim Fox Master

                          Nice analysis.

                          I concur that implementing Paxos will be non trivial.

                          As an initial implementation how about we just use a special node as the store as we discussed previously?

                          • 12. Re: Good article on consistent hashing
                            Andy Taylor Master

                             

                            As an initial implementation how about we just use a special node as the store as we discussed previously?
                            #

                            +1, I'm on the case