1 2 Previous Next 26 Replies Latest reply on Nov 12, 2010 12:33 PM by clebert.suconic

core client reconnect behavior

greggler Nov 11, 2010 1:34 PM

I've read all posts that i can find on the subject of reconnect but can't find the answer to this.

My application is only using only the core API and no JMS. The fringe case that is puzzling me is related to a client process fails in an unexpected fashion and where it is unable to cleanly close the connection. Let's assume the root cause is physical power failure or something that cant be prevented with safe coding. The result I see from watching notification events on the server is that a new consumer is created in addition to the previous consumer associated with the failed process. The new consumer is not receiving messages as expected.

Is there any post or docs i can read to learn more about this subject (specific to the core api)?

Does anyone have advice on how the new client process can identify itself to the server such that it gains control of the pre-existing consumer? (I dont see where a client identifer can be set using the core api)

Assuming there is no clean answer, is the amount of time the old consumer exists before beign declared dead related to the ConnectionTTL set via sessionFactory.setConnectionTTL(millis) ?

thanks,

Greg

adding example test code and ouptput

TestHornetUncleanFailure.java.zip 826 bytes
TestHornetClient.java.zip 1.5 KB
test.log.zip 642 bytes

1. Re: core client reconnect behavior

clebert.suconic Nov 10, 2010 2:04 PM (in response to greggler)

yes.. the connection will stay alive depending on the ConnectionTTL settings. Notice that the checkPeriod has to be < ConnectionTTL. This is described on the user's manual
Actions
2. Re: core client reconnect behavior

greggler Nov 10, 2010 2:32 PM (in response to clebert.suconic)

I do understand about decreasing connectionTTL. However, this merely reduces the number of seconds where a newly connected and perfectly functional client has no ability to receive messages. That helps in some small way but it's not much of a clean solution. Is there no way to prevent the duplicate consumers from forming in the first place?
Actions
3. Re: core client reconnect behavior

clebert.suconic Nov 10, 2010 3:41 PM (in response to greggler)

>> "Is there no way to prevent the duplicate consumers from forming in the first place?"

We only deliver unacked messages after a failure is detected.

You should probably play with Transactions and ack both your data and the receive at the same time. (XA maybe)?

Also: when you create a consumer, if you're doing core API, you can set the flush ACK size. the JMS Equivalent to Auto-ACK sets the flush-size to 0. While dups ok lets a bigger window.

So, you will never duplicate data.

You can also look at pre-ack... if you can afford eventually losing messages in case of failure.
Actions
4. Re: core client reconnect behavior

greggler Nov 10, 2010 9:16 PM (in response to clebert.suconic)

> We only deliver unacked messages after a failure is detected.
It makes sense that un-acked messages taken from a queue by a ClientConsumerA are only delivered to another ClientConsumerB after ClientConsumerA fails (for example by reaching it's TTL).

> You should probably play with Transactions and ack both your data and the receive at the same time. (XA maybe)?
not sure how this can help and it sounds excessively complicated for the situation.

> you can set the flush ACK size.
I dont see a flush ack size but i do see ClientSssionFactory.setAckBatchSize, is that what you are talking about? I'm not clear on how this affects the issue (cant find any docs on this)

> You can also look at pre-ack... if you can afford eventually losing messages in case of failure.
losing messages while the client is down is not the biggest problem, losing them while it is up is a big issue. pre-ack would seem like a bad idea because it would cause ClientConsumerA to ack the messsages on the server side even after the actual client has died. That would ensure the messages are permanently consumed in a way that can never be seen by the real ClientConsumerB

I'm still left with the idea that a given client can not fail and recover in any interval < the TTL. A truly reasonable value for the TTL in my case would be 60 seconds. Supposing the process can restart in about 10 seconds. That means there is a 10 second period with the old ClientConsumerA taking all messages and then a 50 second period where two ClientConsumers exist on the server at the same time round robin taking messages to each consumer. The result is that less than half of the messages are consumed by the truly active ClientConsumerB and the rest are destined either be permanently lost inside ClientConsumerA or at best sit there for 50 seconds until it reaches its TTL and only then ClientConsumerB will receive them delayed and ultimately out of order.

Perhaps it is an in-frequent case for any given client process fail without cleaning up and want to restart quickly. However, it is not unheard of and the side effects with Hornetq seem excessively severe. The goal is to have the new client process pick up cleanly with the next message in the queue not consumed by the previous client process and proceed to receive every message after that.
Actions
5. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 12:24 AM (in response to greggler)

I'm not really sure what you're complaining about here. You can tweak TTL and check time to the values you need.

There's no way to perceive a failure through TCP-IP without a ping-pong approach. Just configure to what you need and it will work. That's a pretty standard practice on any communication system.

Regarding the ack-batch-size: I was referring to this method: http://hornetq.sourceforge.net/docs/hornetq-2.1.2.Final/api/org/hornetq/api/core/client/ClientSessionFactory.html#createSession(boolean,%20boolean,%20int)
Actions
6. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 12:25 AM (in response to clebert.suconic)

createSession(boolean autoCommitSends, boolean autoCommitAcks, int ackBatchSize)
Actions
7. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 12:27 AM (in response to greggler)

I'm still left with the idea that a given client can not fail and recover in any interval < the TTL. A truly reasonable value for the TTL in my case would be 60 seconds. Supposing the process can restart in about 10 seconds. That means there is a 10 second period with the old ClientConsumerA taking all messages and then a 50 second period where two ClientConsumers exist on the server at the same time round robin taking messages to each consumer. The result is that less than half of the messages are consumed by the truly active ClientConsumerB and the rest are destined either be permanently lost inside ClientConsumerA or at best sit there for 50 seconds until it reaches its TTL and only then ClientConsumerB will receive them delayed and ultimately out of order.

The messages will not be re-delivered until the failure has been identified by the server.

That's a pretty standard behaviour you will find in *any* messaging system.
Actions
8. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 12:34 AM (in response to clebert.suconic)

"will receive them delayed and ultimately out of order."

Take a look at Message Groups on the user's manual. Message Groups will guarantee that all messages will be sent to a single consumer.
Actions
9. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 12:38 AM (in response to clebert.suconic)

Also, since you're using core, look at queuQuery on ClientSession.

That method will tell you how many consumers you have on a queue. You may have your client waiting some time in case of a reconnection.

(This is actually what we use to guarantee there's only one consumer to a Topic Subscription as required by the JMS spec. You could have something similar at your queue by using this method).
Actions
10. Re: core client reconnect behavior

greggler Nov 11, 2010 3:25 PM (in response to clebert.suconic)
i think i'm not properly describing the effect i am seeing. please see the attached test code and output log which reproduces the issue as i see it.
(i placed the attachments at the top post, not down here on this reply)

summary of the test:
This test is being run against a stand-alone hornetq server with persistence turned off. also, the netty port has been changed to 9000.-
TestHornetClient.java is designed to either send or receive messages based on a command line argument.
TestHornetUncleanFailure.java is designed to demonstrate a test case of a sending process posting messages to a consumer process that fails without a clean shutdown and then restarts
test.log contains messages from all 3 processes.

The nature of the problem as follows:
receiver process is killed at 11:30:58
new receiver starts at 11:31:18 and proceeds to receive every other message until 11:32:29
at 11:32:29, (71 seconds after the new client process has started) it receives in one lump all of the messages sent while no consumer was running plus the alternating messages that were missed after the receiver restarted.

To sum it up, after a simulated 20 second failure and then a restart, the new healthy client process is experiencing non-intuitive message delivery with delays and ultimately out of order messages for an additional 71 seconds until behavior returns to "normal" These issues are not just applying to messages sent while the client was down, they are affecting message sent after the new client was started.

I hesitate to shorten this 71 second period by adjusting the client TTL because i feel that the TTL is already at an appropriate value. Besides, that doesnt make the problem go away, it only reduces the amount of time it takes effect.
Actions
11. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 4:15 PM (in response to greggler)

As I said earlier you can either:
- play with Message Groups so no other client will receive the message.
- you can disable buffering also... look at the slow consumers examples
- you can check for active clients and use the query.
- killing your client should be an exception. Adjust the TTL and check period for an acceptable value.

You have all the possibilities at hand. Nothing that we need to change here.. just adjust the behaviour you want.
Actions
12. Re: core client reconnect behavior

greggler Nov 11, 2010 5:27 PM (in response to clebert.suconic)

Ok, looking at the slow consumer client example, i have set the ConsumerWindowSize to zero and that has almost entirely resolved the issue.
Using that setting, the test code experiences just one single message (the very next one that wasnt processed by the killed client) being processed out of order 71 seconds late. I can find a way to igore that message.

Honestly, the docs under flow control about slow consumers and client consumer window size all speak of this setting changing the buffering of messages on the client side. I would not have imagined altering this setting could affect what happens on the server side when no client is connected. I think that compared to various other messaging servers HornetQ is doing more creation of dynamic entities on the server which represent the dynamic state of a client and the resulting complexity causes a learning curve.

Anyhow, thanks for the tip!
Actions
13. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 5:32 PM (in response to greggler)

Why don't you just avoid initializing the client until the server has identified the failure?

something like?

while (true)
{
     queueResult = sesssion.queryQueue(...);
     if (queueResult.... Number.Of.Active.Consumers > 0) sleep; continue;
     else... .break

}

Setting consumerWindowSize to zero will have performance impact... as you will always be doing a round trip. (That may not be an issue depending in your perf requirements).
Actions
14. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 6:13 PM (in response to greggler)

I think that compared to various other messaging servers HornetQ is doing more creation of dynamic entities on the server

Well... first this is very well documented..

Second... buffering at the client, and failure handling is a common feature in every messaging system I've seen out there (Yes.. I have played with other messaging systems before... I guess that's why I'm on this industry). I don't know how you can claim it to be any different.

Anyway.. glad you fixed your problems now.
Actions

1 2 Previous Next

Go to original post