I have thoroughly read the docs, FAQ, and discussions (such as https://community.jboss.org/message/568278, https://community.jboss.org/message/551502), I am not sure whether we are seeing is a bug, probably I am missing something, I would appreciate your comments.
Setup (simplification of a more complex application):
System P: bunch of producers, sending messages to a queue Q1. Every message sent by a given producer belongs to the same messageGroup, i.e. relationship between producers and groups is 1-1 (this is a simplification over the real application).
System C1: bunch of MDB consumers, reading from queue Q1.
System C2: bunch of MDB consumers, reading from queue Q1.
C1 and C2 are clustered.
Standalone HornetQ: Queue Q1
Each messageGroup is associated to an event stream for a given user-application pair, those events must be processed in order. Groups are long lived.
1) Producers happily sending messages, being consumed in C1 and C2. C1 cleanly stops. Its groups are quickly reassigned to consumers in C2. No message is lost or out of order.
2) Producers happily sending messages, being consumed in C1 and C2. C1 abruptly disappears (simulated with kill -9). Server waits for connection-ttl to expire, groups previously consumed in C1 are now assigned to consumers in C2. In this case some groups have sometimes out of order messages.
Schematic example explaining what we see in case 2:
Producer P1 in system P sends messages belonging to group G1: ..., M1-G1, M2-G1, M3-G1, M4-G1, M5-G1, M6-G1, M7-G1, M8-G1, ...
Those messages were consumed by a consumer in C1. C1 abruptly disappears. When the group G1 is assigned to a consumer in C2, this is what (sometimes) this consumer receives:
M500-G1, M501-G1,..., M700-G1, M498-G1, M499-G1, M701-G1, M702-G1, ...
In this example, messages M498 and M499 arrive out of order, after a while.
My understanding is that case (2) should work the same as case (1) (of course once server realizes problem in C2 when connection-ttl expires). The messages not ack'ed by the dead client should be put at the head of the queue, and delivered according to messageGroup rules, so I would expect that messages in group G1 should be delivered to a consumer in C2 in order.
But it seems some messages are kept back for some reason, and delivered after a while.
We have tested with different HornetQ releases, played with some settings in HornetQ server (thread pool sizes, connection-ttl-override, transaction-timeout, ...) and MDBs (useLocalTx, CMT/bean controlled transactions, ...), but not been able to fix.
Number of out of order msgs is smaller when useLocalTx is activated in the MDBs, so it seems related to transactions. Also when thread-pool-max-size and scheduled-thread-pool-max-size are increased, the number of out-of-order messages is smaller. We still see same behaviour with connection-ttl-override = 15000, transaction-timeout = 10000, transaction-timeout-scan-period = 500.
I can provide test programs, this does not happen in every test, but it is reproducible. I am happy to share configurations, etc. but before overloading with info, I would first like to check whether my understading is correct.
Question: Is this expected? Any setting that might help?
Thanks for your expertise