Message Accumulation due to slow consumers and a proposed solution
gaohoward Aug 6, 2014 10:38 AM== The issue
In a cluster when a local queue is empty, its consumers are in starvation state. If there are messages on other remote queues, those messages won't get redistributed as long as they have their local consumers, even that those consumers appears slow and messages are accumulating.
Suppose we have a queue deployed in a 4-node cluster (nodeA, B, C and D) and a producer is sending messages to nodeA at rate 40 messages / sec.
So each node gets 10 messages from the producer per second. Suppose the consumers on each node and their rates are as follows:
NodeA - 2 consumers:
consumer1A : 1 messages / sec
consumer2A : 1 messages / sec
NodeB - 1:
consumer1B : 4 messages / sec
NodeC - 2:
consumer1C : > 5 messages / sec
consumer2C : > 5 messages / sec
NodeD - 1:
consumer1D : > 10 messages /sec
When all are up and running, the queue at nodeC and nodeD will often be empty due to their fast consumer rates.
The queue at nodeA will have messages accumulated at rate 8 messages/sec. The queue at nodeB will have messages accumulated at rate 6 messages/sec.
If the producer stops after 30 seconds, the messages at each node:
nodeA : 8 * 30 = 240 messages
nodeB : 6 * 30 = 180 messages
nodeC and nodeD : 0 message
To consume all those messages, node1 will take 240/2 = 120 seconds. Node 2 will take 180/4 = 45 seconds. Node3 and node4 will have been idle during this whole time.
== The proposed solution:
When a node has been idle for a certain time (like 2 sec, configurable), it sends a “STARVATION” notification message. In the above case nodeC and nodeD will send it.
Nodes in the cluster receiving this notifications will trigger a 'message redistribution' as long as they have message in their queues. In the above case:
NodeA receives the notification and triggers a 'message redistribution' on such conditions, and so does nodeB.
== Redistribution Details
Different from the other redistribution (redistribution-delay), this kind of redistribution applies only to nodes that sends the 'STARVATION' notification.
In the above said scenario,
nodeA gets notifications from nodeC and nodeD. It keeps a list of them {nodeC, nodeD}
nodeB too gets notifications from nodeC and nodeD. {nodeC, nodeD}.
So the messages will be redistributed among nodeC and nodeD from nodeA and nodeB.
Not all messages need to be redistributed as the queue still has consumers (even if they're slow).
We can decide how many of the total messages will be redistributed by a fixed ratio (e.g. 50%).
(actual amount may be less because some messages won't get redistributed because of grouping).
With the redistribution, the messages can be even out among nodes in a reasonable time period.