1 2 Previous Next 15 Replies Latest reply on Apr 23, 2013 12:12 PM by clebert.suconic

A slow consumer can make consumers on different machines slow as well

steven.hulshof Feb 9, 2013 1:09 PM

Hey guys,

We're using HornetQ as our JMS server and we see have issues with latency in our system when a consumer is crashing. With crashing I mean for example the host machine is going out of memory.

The problem we have is that if a consumer is crashing the latency of messages for other consumers increases dramatically. One slow consumer has a big impact on the entire system.

We can best repeat this in our load test environment of which I will explain the setup here.

The primary goal of the load test is to measure the latency of messages, i.e. the time it takes for a message to go from the producer via HornetQ to the consumer.

In the environment there is one HornetQ server running. There are 5 other machines on which we run the producers and consumers.

Distributed over those 5 machine are 100 producers. Each producer has it's own connection to HornetQ. There is an exactly same number of consumers on those machines.

There are 100 topics to publish messages on.

All 20 producers on machine 1 will publish messages randomly on topics 1 till 20.

All 20 producers on machine 2 will publish messages random on topics 21 till 40.

etc

Similarly distributed over those 5 machine are the 100 consumers. Each consumer also has it's own connection with HornetQ. Each consumer will exclusively subscribe one topic.

A consumer on machine 1 will subscribe on one topic from the range 1 till 20, but no two consumers can subscribe on the same one.

A consumer on machine 2 will subscribe on one topic from the range 21 till 40.

etc etc etc

As you can see consumers on a machine will only subscribe on messages being published on the same machine.

Combined all producers will publish 150,000 messages per second. This will keep this running for 5 minutes and then we look at the average latency.

With the current configuration parameters of HornetQ that is around 1.2ms.

Where everything goes south is the point where I let a custom consumer subscribe on all of the topics. I want to simulate it is running on a crashing machine, so I also add a breakpoint on ChannelImpl.send(Packet,boolean,boolean) and on ChannelImpl.handlePacket(Packet). These breakpoint will log the evaluated expression of Thread.sleep(1000l) and not pause the process. So in effect there is a one second delay every time one of these methods is called.

To no surprise the latency of the messages received on this client skyrockets. In the end the messages end up being expired in HornetQ server.

What we also see that the latency of messages send to the client subscribing on the same topic also skyrockets to from +-1 millisecond to 7 seconds and more.

If I do the same, but do not add the delays in the ChannelImpl, all other consumers are fine.

What we want to prevent is that this one slow consumer makes other consumers slow as well.

Do you guys have any input on this? Maybe important as well, do you actually consider this a valid use case (one slow consumer should not take down others as well)?

Hornetq is configured like this:

nio netty connector

Default thread pools

no paging

no persistence

no transactions

block on send is put to false

message are pre-acknowledged

message expiry time is 15 seconds

message expiry scan of 7.5 seconds

connection time out is 7.5 seconds

client failure check period 1.875 seconds

confirmation window size is 1Mb

all messages are non-durable

We think that one of the problems is the lock contention within HornetQ. I've attached a stack dump where you can see that all IO Workers threads are blocked by the thread which is cleaning up expired messages. This means that for the duration of the lock, producers cannot push messages anymore to other queue as well.

For a completely different problem we spotted that ServerMessageImpl.incrementRefCount and ServerMessageImpl.decrementRefCount where heavily contented.

These methods on which the other threads are blocked do not need a lock at all. The consumerList in QueueImpl can be replaced with a copyonwritearraylist, thereby removing the need for a synchronisation lock on hasMatchingConsumer. Similarly for the ServerMessageImpl the counts can be replaced with AtomicIntegers, thereby also removing the need to put so many locks on ServerMessageImpl.

Do you guys have any input on this as well?

hornetq.txt.zip 4.7 KB

1. Re: A slow consumer can make consumers on different machines slow as well

clebert.suconic Feb 9, 2013 1:18 PM (in response to steven.hulshof)

did you try direct-deliver = false on Netty?

did you try NIO (on netty, I'm not talking about the journal... NIO on netty)

With direct deliver the system will try to perform the delivery inside the queue. that's off course only good when you can guarantee the quality of your consumers. if you can't you must disable direct deliver on the acceptor.
Actions
2. Re: A slow consumer can make consumers on different machines slow as well

steven.hulshof Feb 9, 2013 3:47 PM (in response to clebert.suconic)

We're currently running on NIO and the problem persists.

I did disable directDeliver but then the latency figures became unacceptable in normal flow. I never continued on this path and investigated if it solved the problem.
I'm going through the HornetQ code now and I doubt it will actually will solve the problem.
Disabling direct deliver will eventually start a seperate thread, which will call QueueImpl.deliver. In this method the message will be send with the QueueImp.handle(MessageReference,Consumer) method. This is a blocking call, no matter if you use NIO or OIO . This method is also synchronized and also it's caller QueueImpl.deliver is synchronized. So while QueueImpl.deliver is busy sending, the entire QueueImpl instance is locked and cannot be used by another thread. We know the machine we're sending the message to is struggling, so sending the message will take a lot of time. This means the QueueImpl will be locked for a relatively long time.
And while this thread is busy on sending the message to the failing consumer, no other thread can request the queueimpl wheter an incoming message is applicable for this queue and at the same time the other consumers in the queue's consumerList are waiting for the message to be delivered to them. All those threads will need to wait till the QueueImpl.deliver has either delivered the message or the NettyConnection.write(HornetQBuffer,boolean,boolean) times out on it's future after 10(!) seconds.

A nice new feature would be that the actual deliver of the message asynchronous.

About your quality of the consumer remark. One can never safe guard for hardware failure of a single client. Applying your comment to this, means that everyone should always disable direct-deliver. You can never guarantee the quality of your consumers.

btw: in the NettyConnection class a InterruptedException is caught at line 230, but the Thread.interrupted status is not reset. That is bug, see http://www.ibm.com/developerworks/java/library/j-jtp05236/index.html
Actions
3. Re: A slow consumer can make consumers on different machines slow as well

clebert.suconic Feb 10, 2013 9:13 PM (in response to steven.hulshof)

AFAIK an interrupt thread on these cases are coming from either Netty Executor, or our executor.. and as far as I remember it will be destroyed if interrupted (it wont be reused). So, I recently changed the behaviour to treat interrupts and from what I looked these threads were going to die.. I will review if that's not the case. (Or if you're telling me that's not the case this definitely needs fixing).

Also: I will work on this case tomorrow (my monday). Thanks for bringing it up.
Actions
4. Re: A slow consumer can make consumers on different machines slow as well

clebert.suconic Feb 11, 2013 4:13 PM (in response to clebert.suconic)

Thanks for the posting.. this posting brought a lot of light on what you are going through now.

We first need to clear the testsuite on your PR.... I'm looking more closely to your proposed changes now.

Would you be open to join an IRC? that would be very productive I think.

And regarding the interrupted Exception.. I'm pretty sure the thread would be dying after an interrupt... I just throw an Exception and let the caller to decide what to do. As far as I know the Netty executor will not put the thread back on the pool of threads.
Actions
5. Re: A slow consumer can make consumers on different machines slow as well

clebert.suconic Feb 11, 2013 5:43 PM (in response to clebert.suconic)

I'm having some difficulty to get your branch here... can you rebase on master... and squash your committs? I'm having conflicts between your own commits... so I'm confused there.
Actions
6. Re: A slow consumer can make consumers on different machines slow as well

clebert.suconic Feb 14, 2013 7:46 PM (in response to clebert.suconic)

I just submitted a PR with these changes... I'm still working into some other cases and I will write a test for it soon

I have opened a JIRA: https://issues.jboss.org/browse/HORNETQ-1137

And this is the PR: https://github.com/hornetq/hornetq/pull/860
Actions
7. Re: A slow consumer can make consumers on different machines slow as well

steven.hulshof Feb 15, 2013 3:34 AM (in response to clebert.suconic)

Great. I'll be following the Jira and PR.
Actions
8. Re: A slow consumer can make consumers on different machines slow as well

clebert.suconic Feb 28, 2013 5:04 PM (in response to steven.hulshof)

Can you take a look at the latest changes? it will cope with your case really well now.

even non direct deliver should have an improve on its latency.
Actions
9. Re: A slow consumer can make consumers on different machines slow as well

steven.hulshof Mar 1, 2013 12:18 PM (in response to steven.hulshof)

I had a brief look today, but I was too much to quickly go through. I'll have a look next week.
Thanks for the work! You made some people here very happy by the knowledge that you are working on these kinds of problems.
Actions
10. Re: A slow consumer can make consumers on different machines slow as well

steven.hulshof Mar 10, 2013 5:30 PM (in response to clebert.suconic)

I was finally able to have a look.
Nice that you were able to add a test for this. Functionally everything looks fine, but we won't be able to test it untill 2.3.0 is released. Do you have any idea when that will be? Is 2.3 backwards compatible with 2.2?
The only remark I have is that the design looks very complicated now. The Queue takes over the responsiblity when to publish messages from the ServerConsumer. Is there no way to keep the old interface of Consumer (so take out again the proceedDeliver message) and make the deliver method unsynchronized on the deliver part? I understand variables are shared they must be memory visible, but is the lock that synchronized enforces really necessary? I can't fully judge this, but it appears to me that in the original code the scope of the synchronized block could have been limited to the code that uses the messageReferences variable. The other member variable which is being used is the pos integer, but that can be converted into an AtomicInteger.
Actions
11. Re: A slow consumer can make consumers on different machines slow as well

clebert.suconic Mar 12, 2013 1:12 AM (in response to steven.hulshof)

the Distribution has to be done within a lock... but to the actual delivery. So, that's why I made it into two separate methods.

I honestly couldn't make it better.. and this is not just about that counter. this has to be atomic with a lot of the things happening on the distribution.. and I'm not sure how to do that without the lock on those things.
Actions
12. Re: A slow consumer can make consumers on different machines slow as well

clebert.suconic Mar 12, 2013 1:13 AM (in response to clebert.suconic)

2.3 is compatible with 2.2 on the client... However I'm working now on bringing this into 2.2 branches. I wanted to avoid it.. but I will have to do it for some issues we had.
Actions
13. Re: A slow consumer can make consumers on different machines slow as well

steven.hulshof Mar 12, 2013 4:57 AM (in response to clebert.suconic)

Okay, I thought as much because otherwise you probably wouldn't have split it up this way.

So we first upgrade our clients to the newest, yet to be released, 2.2 version of Hornetq before we can migrate the server to 2.3 without issues, correct?
Actions
14. Re: A slow consumer can make consumers on different machines slow as well

clebert.suconic Mar 12, 2013 1:13 PM (in response to steven.hulshof)

There are no changes on the client runtime for this.... you could get the current 2.2 and run against 2.3...

We are moving these changes to the server on 2.2... there are a few customers (from the EAP side) that won't wait for 2.3.
Actions

1 2 Previous Next

Go to original post