I have attached a simple (maven compilable) example that can be modified slightly to demonstrate different issues I have been experiencing. I will address first the most important issue which should be demonstrated by default.
To explain the example code and what it does:
- Start two embedded servers in a cluster configured not to route messages without consumers, to redistribute with 0 delay, and other less important details can be discerned in reading the xml configuration files. The cluster self discovers with UDP.
- Connect a client separately to each cluster node and create an example queue, disconnecting after each performance of this task.
- Connect a single producer to the first node and begin sending messages, grouped, to the example queue.
- Simultaneously, and after a small delay, a process begins where a consumer thread is created with fresh server discovery/session/etc. instances, makes a connection to one of the two nodes, receives a fixed number of messages then disconnects. This process is repeated where the next thread created connects to the opposite server as the last and pulls in the same number of messages until all messages expected are received.
What is expected is that all messages will be received, in the order they were published. It is expected that message redistribution will occur each time a consumer is connected.
So what really happens is the first consumer connects to the opposite server the first messages were delivered to and receives messages, in order, then disconnects. The second consumer connects to the original server, still receiving messages in order and disconnects. Then the third consumer connects to the (again, the alternate) server and waits for the next block of messages but never receives anything.
What appears to be happening is message redistribution happens for the first consumer (since it connects to the opposite server the original messages are being delivered to), then again for the second consumer, redistributing back to the original server, but never again.
NOW HERE IS WHERE IT GETS INTERESTING
If I comment out the line that adds the message grouping header from the publishing thread, message redistribution continues to occur until all of the messages have been received by all the alternating consumer threads. However, the message order it entirely scrambled by the end of the test.
So I have one problem that is definitely a problem: Why does message redistribution stop after two consumer relocations with message grouping on? And only with message grouping on. And Is message order in a cluster with redistribution guaranteed when there is only ever one single producer and one single consumer for a given address at any given moment?
The first problem is almost definitely a bug.
And by the way, I'm testing this against 2.3.0.Beta1 but it happens in the latest 2.2.x, except that message order is not always preserved for grouped messages during redistribution (it's hard to reproduce because it appears to be a race condition that rarely happens on my machine, but it does indeed happen in 2.2.21).
The second problem may or may not be expected behavior. I can work around it if it's not. I would rather not have to, but, it's not a dealbreaker for me. The first problem is, though.
You see, I'm trying to build an elastic message oriented system and I need to be able to add and remove cluster nodes permanently. This isn't a failover issue. I just need to adjust capacity over time. In order to make this as easy as possible, I would relocate all the clients on a node scheduled for removal to trigger message redistribution. I would use the management tools to watch for the queues to empty. Once all empty, I would continue removing that cluster node. I can easily control the client connections (they will all be InVM, and cluster nodes will be connected with the usual netty connector) and almost everything works just how I need it.
It's only the resolution of these last two issues that I need to know about before I can go ahead with the project using HornetQ.
So I appreciate any help I can get here, even if it's just a confirmation/denial of the expectations on the second issue, and a guess as to when the first issue will be fixed (or perhaps some definitive explanation of what I am somehow doing wrong that stops redistribution after two relocations).
TwiceOnlyDistExample.zip 23.5 KB