We're using HornetQ as our JMS server and we see have issues with latency in our system when a consumer is crashing. With crashing I mean for example the host machine is going out of memory.
The problem we have is that if a consumer is crashing the latency of messages for other consumers increases dramatically. One slow consumer has a big impact on the entire system.
We can best repeat this in our load test environment of which I will explain the setup here.
The primary goal of the load test is to measure the latency of messages, i.e. the time it takes for a message to go from the producer via HornetQ to the consumer.
In the environment there is one HornetQ server running. There are 5 other machines on which we run the producers and consumers.
Distributed over those 5 machine are 100 producers. Each producer has it's own connection to HornetQ. There is an exactly same number of consumers on those machines.
There are 100 topics to publish messages on.
All 20 producers on machine 1 will publish messages randomly on topics 1 till 20.
All 20 producers on machine 2 will publish messages random on topics 21 till 40.
Similarly distributed over those 5 machine are the 100 consumers. Each consumer also has it's own connection with HornetQ. Each consumer will exclusively subscribe one topic.
A consumer on machine 1 will subscribe on one topic from the range 1 till 20, but no two consumers can subscribe on the same one.
A consumer on machine 2 will subscribe on one topic from the range 21 till 40.
etc etc etc
As you can see consumers on a machine will only subscribe on messages being published on the same machine.
Combined all producers will publish 150,000 messages per second. This will keep this running for 5 minutes and then we look at the average latency.
With the current configuration parameters of HornetQ that is around 1.2ms.
Where everything goes south is the point where I let a custom consumer subscribe on all of the topics. I want to simulate it is running on a crashing machine, so I also add a breakpoint on ChannelImpl.send(Packet,boolean,boolean) and on ChannelImpl.handlePacket(Packet). These breakpoint will log the evaluated expression of Thread.sleep(1000l) and not pause the process. So in effect there is a one second delay every time one of these methods is called.
To no surprise the latency of the messages received on this client skyrockets. In the end the messages end up being expired in HornetQ server.
What we also see that the latency of messages send to the client subscribing on the same topic also skyrockets to from +-1 millisecond to 7 seconds and more.
If I do the same, but do not add the delays in the ChannelImpl, all other consumers are fine.
What we want to prevent is that this one slow consumer makes other consumers slow as well.
Do you guys have any input on this? Maybe important as well, do you actually consider this a valid use case (one slow consumer should not take down others as well)?
Hornetq is configured like this:
nio netty connector
Default thread pools
block on send is put to false
message are pre-acknowledged
message expiry time is 15 seconds
message expiry scan of 7.5 seconds
connection time out is 7.5 seconds
client failure check period 1.875 seconds
confirmation window size is 1Mb
all messages are non-durable
We think that one of the problems is the lock contention within HornetQ. I've attached a stack dump where you can see that all IO Workers threads are blocked by the thread which is cleaning up expired messages. This means that for the duration of the lock, producers cannot push messages anymore to other queue as well.
For a completely different problem we spotted that ServerMessageImpl.incrementRefCount and ServerMessageImpl.decrementRefCount where heavily contented.
These methods on which the other threads are blocked do not need a lock at all. The consumerList in QueueImpl can be replaced with a copyonwritearraylist, thereby removing the need for a synchronisation lock on hasMatchingConsumer. Similarly for the ServerMessageImpl the counts can be replaced with AtomicIntegers, thereby also removing the need to put so many locks on ServerMessageImpl.
Do you guys have any input on this as well?
hornetq.txt.zip 4.7 KB