4 Replies Latest reply on Aug 11, 2014 10:39 AM by clebert.suconic

    Message Accumulation due to slow consumers and a proposed solution

    gaohoward

      == The issue

       

      In a cluster when a local queue is empty, its consumers are in starvation state. If there are messages on other remote queues, those messages won't get redistributed as long as they have their local consumers, even that those consumers appears slow and messages are accumulating.

       

      Suppose we have a queue deployed in a 4-node cluster (nodeA, B, C and D) and a producer is sending messages to nodeA at rate 40 messages / sec.

       

      So each node gets 10 messages from the producer per second. Suppose the consumers on each node and their rates are as follows:

       

      NodeA - 2 consumers:

      consumer1A : 1 messages / sec

      consumer2A : 1 messages / sec

       

      NodeB - 1:

      consumer1B : 4 messages / sec

       

      NodeC - 2:

      consumer1C : > 5 messages / sec

      consumer2C : > 5 messages / sec

       

      NodeD - 1:

      consumer1D : > 10 messages /sec

       

      When all are up and running, the queue at nodeC and nodeD will often be empty due to their fast consumer rates.

       

      The queue at nodeA will have messages accumulated at rate 8 messages/sec. The queue at nodeB will have messages accumulated at rate 6 messages/sec.

       

      If the producer stops after 30 seconds, the messages at each node:

       

      nodeA : 8 * 30 = 240 messages

      nodeB : 6 * 30 = 180 messages

      nodeC and nodeD : 0 message

       

      To consume all those messages, node1 will take 240/2 = 120 seconds. Node 2 will take 180/4 = 45 seconds. Node3 and node4 will have been idle during this whole time.

       

      == The proposed solution:

       

      When a node has been idle for a certain time (like 2 sec, configurable), it sends a “STARVATION” notification message. In the above case nodeC and nodeD will send it.

       

      Nodes in the cluster receiving this notifications will trigger a 'message redistribution' as long as they have message in their queues. In the above case:

       

      NodeA receives the notification and triggers a 'message redistribution' on such conditions, and so does nodeB.

       

      == Redistribution Details

       

      Different from the other redistribution (redistribution-delay), this kind of redistribution applies only to nodes that sends the 'STARVATION' notification.

       

      In the above said scenario,

       

      nodeA gets notifications from nodeC and nodeD. It keeps a list of them {nodeC, nodeD}

      nodeB too gets notifications from nodeC and nodeD. {nodeC, nodeD}.

       

      So the messages will be redistributed among nodeC and nodeD from nodeA and nodeB.

       

      Not all messages need to be redistributed as the queue still has consumers (even if they're slow).

       

      We can decide how many of the total messages will be redistributed by a fixed ratio (e.g. 50%).

      (actual amount may be less because some messages won't get redistributed because of grouping).

       

      With the redistribution, the messages can be even out among nodes in a reasonable time period.

        • 1. Re: Message Accumulation due to slow consumers and a proposed solution
          clebert.suconic

          That would be a feature.. but yes.. it sounds reasonable.

           

          This would be similar to the slow consumer issue being done by Justin, however this would apply to all the consumers (differently to just one).

           

           

          I'm still wondering if we really need this? if we have slow consumer -> kill the user could then just kill the consumer? or maybe we could make a change on Justin's work by adding another policy -> Redistribute. But he would then need to apply his calculations to the entire set of consumers. If all consumers are slow, then redistribute.

           

           

          We would need to have a policy to stop the redistributor on this case.

          • 2. Re: Message Accumulation due to slow consumers and a proposed solution
            gaohoward

            Yes its similar to Justin's but his focus is on performance impact. But it may not solve the message accumulation problem. For example if all consumers are considered fast, but as long as there are rate differences, there is a possible change that messages are built up at a relative slow queues if the overall producer rate is high enough.

             

            In that case all consumers won't get killed and messages keeps built up on some queues.

             

            Howard

            • 3. Re: Message Accumulation due to slow consumers and a proposed solution
              gaohoward

              and I'll do some research and think if it's possible to do it based on Justin's work. Anyway Justin's work is a piece of its own and we can do it as a simple add-on. That's my current thought.

               

              Thanks

              • 4. Re: Message Accumulation due to slow consumers and a proposed solution
                clebert.suconic

                if consumers are not slow.. and still consuming messages.. that's a different scenario.. it's not starvation.

                 

                there are two things about this task:

                 

                 

                - load balance rate: Right now we do equal distribution. We could do based on weight of the nodes. (JBM does that I think)

                - message redistribution based on slow consumers... that's another thing.

                 

                we were so far talking about redistribution.