Adding a few details:
- the missing half of messages accumulates on the cluster core queue (e.g address="sf.my-explicit-cluster.1fa2774...")
- the cluster core queue believes it has one consumer (ConsumerCount=1). That consumer was the killed node - the 'real' consumer count is zero.
So it seems that the remaining node distributes messages to its (existing, healthy) consumer but also to the dead cluster node.
- that a HornetQ cluster won't handle a kill -9 (or brutal crash) on a node -> the proper approach being live/backup pairs?
- OR that there is something awry, in my config or in the code?
HornetQ 2.3.1 of EAP6.1 do the same stuff.
Any thoughts how this could be fixed?
the bridge will be not know that the node has crashed until the TTL is reached and a ping fails, until this time it will keep on forwarding messages to the queue.
Is it configurable? If hardcoded, what is TTL value?
Sorry, not yet familiar with HornetQ enough.
Waited for up to 40 minutes after a node kill, the other node did not start processing queued messages.
It only starts when "crashed" node goes up.
The messages will be on the crshed server, if you want them you will need to provide a backup server
Nope, I mean the other thing: I care not about the messages left on the crashed server not processed, but about newcoming messages after server crash that are sent to alive server. They are successfully received by alive server, but only half of them are processed by consumer on an alive server. The other half seems to be stored "for the crashed server" and not processed on an alive node.
When I start up crashed server, the messages left unprocessed (stored?) are being processed on the server, which HAS NOT BEEN DOWN, not by that I just started. And then everything goes fine.
BUT! If I try another scenario: stopping a server gracefully (instead of emulating crash) - all new incoming messages are properly processed on an alive one, it does proper handover for stopped consumers I think.
To me the "kill -9" scenario seems to be a critical bug, and it is found in EAP6.1 Final which is commercial release (hornetq-2.3.1-redhat version)! I have not yet got any redhat subscription to file an issue or support request, so just trying to figure out maybe I did something wrong.
EDIT: found old bug, seems to be the same issue: https://issues.jboss.org/browse/JBPAPP-5799
like I said before this is not a bug, they will be sitting on the live server. If you want HA then you need to backup servers. The reason a graceful shutdown works is probably just because of timing and the messages are redistributed. Again, not a bug, use active/passive servers to recover messages.
Sorry, I cant get the point
May be it's just misunderstanding of HornetQ architecture.
1. I don't need to "recover" messages, just want to successfully process new incoming messages on a single live server
2. If I have 2 active/active JBoss EAP servers running in cluster and see messages sent to jms queue are successfully processed by consumers on both nodes, and when 1 node fails, half of _new_ messages do not reach consumer on a running server.
3. What if server had HW failure and can't be replaced in a reasonable time? Only half of new messages that come to alive cluster node is processed :-( When system runs several thousand messages a second that should lead to complete system unavailability until I recover failed node or do the restart of running one.
You just configure the cluster connection to what you need. I would expect clustered nodes to come back live otherwise messages that are meant to be sent to that node would be on that node.
We may change the default.. I agree, but that's what we have now... we can't never please every user. There are users who would complain if we changed the behaviour. We had other cases where changing default configs pleased some and bothered others.
So, I would suggest you just to make the change on cluster connection and you have what you need.
You suggested to make changes in cluster conection. I made changes like this:
But still even after i see message about cluster node going down and waiting for connection-ttl time, messages for failed node are still not being delevered to the alive node.
I thought that the problem may be with core bridge, which is following to documentation (http://docs.jboss.org/hornetq/2.3.0.Final/docs/user-manual/html_single/#d0e10337) is created transparently and is having default settings. And if i am not mistaken reconnect-attempts is -1 by default, so i tried to change it, but i didn't found out how to do it. I tried to explicitly create a bridge in messaging subsystem of JBoss 6.1 EAP, but i think it is not used by my queue, because the failover behaviour didn't changed.
I would be gratefull if you could point me the place where i should set theese connection parameters, so that messages for the failed node would be redirected to the alive node.
Any news here?
Messages that are already sent to a node will stay on that node as long as you won't restart the node.. you could either provide a backup or restart that node.
Also: you just need to setup reconnect attempts.. I'm not sure what's the issue there? this works as expected as far as I know. if you were having issues everybody would.. especially that you are using 2.3.0.