2 Replies Latest reply on Oct 19, 2012 7:11 AM by ataylor

Message redistribution halts after messages redistributed twice

ztj Oct 17, 2012 2:28 PM

I have attached a simple (maven compilable) example that can be modified slightly to demonstrate different issues I have been experiencing. I will address first the most important issue which should be demonstrated by default.

To explain the example code and what it does:

Start two embedded servers in a cluster configured not to route messages without consumers, to redistribute with 0 delay, and other less important details can be discerned in reading the xml configuration files. The cluster self discovers with UDP.
Connect a client separately to each cluster node and create an example queue, disconnecting after each performance of this task.
Connect a single producer to the first node and begin sending messages, grouped, to the example queue.
Simultaneously, and after a small delay, a process begins where a consumer thread is created with fresh server discovery/session/etc. instances, makes a connection to one of the two nodes, receives a fixed number of messages then disconnects. This process is repeated where the next thread created connects to the opposite server as the last and pulls in the same number of messages until all messages expected are received.

What is expected is that all messages will be received, in the order they were published. It is expected that message redistribution will occur each time a consumer is connected.

So what really happens is the first consumer connects to the opposite server the first messages were delivered to and receives messages, in order, then disconnects. The second consumer connects to the original server, still receiving messages in order and disconnects. Then the third consumer connects to the (again, the alternate) server and waits for the next block of messages but never receives anything.

What appears to be happening is message redistribution happens for the first consumer (since it connects to the opposite server the original messages are being delivered to), then again for the second consumer, redistributing back to the original server, but never again.

NOW HERE IS WHERE IT GETS INTERESTING

If I comment out the line that adds the message grouping header from the publishing thread, message redistribution continues to occur until all of the messages have been received by all the alternating consumer threads. However, the message order it entirely scrambled by the end of the test.

So I have one problem that is definitely a problem: Why does message redistribution stop after two consumer relocations with message grouping on? And only with message grouping on. And Is message order in a cluster with redistribution guaranteed when there is only ever one single producer and one single consumer for a given address at any given moment?

The first problem is almost definitely a bug.

And by the way, I'm testing this against 2.3.0.Beta1 but it happens in the latest 2.2.x, except that message order is not always preserved for grouped messages during redistribution (it's hard to reproduce because it appears to be a race condition that rarely happens on my machine, but it does indeed happen in 2.2.21).

The second problem may or may not be expected behavior. I can work around it if it's not. I would rather not have to, but, it's not a dealbreaker for me. The first problem is, though.

You see, I'm trying to build an elastic message oriented system and I need to be able to add and remove cluster nodes permanently. This isn't a failover issue. I just need to adjust capacity over time. In order to make this as easy as possible, I would relocate all the clients on a node scheduled for removal to trigger message redistribution. I would use the management tools to watch for the queues to empty. Once all empty, I would continue removing that cluster node. I can easily control the client connections (they will all be InVM, and cluster nodes will be connected with the usual netty connector) and almost everything works just how I need it.

It's only the resolution of these last two issues that I need to know about before I can go ahead with the project using HornetQ.

So I appreciate any help I can get here, even if it's just a confirmation/denial of the expectations on the second issue, and a guess as to when the first issue will be fixed (or perhaps some definitive explanation of what I am somehow doing wrong that stops redistribution after two relocations).

Thanks!

TwiceOnlyDistExample.zip 23.5 KB

1. Re: Message redistribution halts after messages redistributed twice

ataylor Oct 18, 2012 6:22 AM (in response to ztj)

Im looking at why delivery stops now, however regarding message order, it is *not* guaranteed in any scenario when re distribution occurs, this is because messages sent by a single producer can end up on any server, so in your test this could happen

1. you send 5000 messages to server0 which are redistributed to server 1 and sit there.
2. a consumer connects to server1 and consumes 250 messages and disconnects
3. a consumer connects to server0
4. server 1 starts to redistribute and the consumer receives
5. consumer disconnects and server 1 stops redistributing messages

at this point you will have messages on both servers so order is no longer guaranteed, the fact that it happens when you dont have message groups is probably coincidental, maybe there is long enough when using groups for all the messages to be redistributed.

One other point, when you use message groups in a cluster you should try and make the number of producers/consumers stable, theres a section on thisin the docs, this is to stop the the scenario that you see. I would probably have redistribution switched off and have at least one consumer per server.

I will spend some time investigating today, see if there are any issues.
Actions
2. Re: Message redistribution halts after messages redistributed twice

ataylor Oct 19, 2012 7:11 AM (in response to ataylor)

ive found a bug which i have raised https://issues.jboss.org/browse/HORNETQ-1061

however i would try to configure your application not to get into a state where this occurs.
Actions

Go to original post