core client reconnect behavior| JBoss.org Content Archive (Read Only)

15. Re: core client reconnect behavior

greggler Nov 11, 2010 6:14 PM (in response to clebert.suconic)

i realize that the window size of zero has a performance impact but i'm going to have to evaluate the exact results to see how much that matters. the system in question that has me evaluating hornetq has been in production using the older jbossmq messaging for many years. it is a critical system deployed across a complex wide area network and so the evaluation is being done carefully and thoroughly to cover every possible situation.

one important type of client is located on a truly remote server and has to be hardened for high reliability with no external intervention regardless of what happens on the box. that means immediate client restart upon failure whether that be caused by a crash in the java process or even a hard power cycle of the box.

I understand that a valid alternate strategy is for the client to start but remain inactive until there are zero consumers on the queue. However, in the trival process restart case, that would be adding aproximately 71 seconds of additional outage in message services which needs to be avoided.

16. Re: core client reconnect behavior

greggler Nov 11, 2010 6:46 PM (in response to clebert.suconic)

I am familiar with the patterns of buffering and failure handling but those are on the client side and there is still something uncommon taking place here on the server side. Seems like we are discussing a fine point of effects from a server-side internal object that is indirectly created as a result of a client session. What was unexpected for me was the notion that the internal server-side object in question had the ability to affect message delivery in any way after the client process which caused its creation ceased to exist and a new one was started.

17. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 8:12 PM (in response to greggler)

There's nothing uncommon... we just will redeliver any message that is unacked... if you expect your client to leave without closing its connection.. so.. good luck. That should be an exception at your system.. not a rule.

If you want strict rule.. you can't have two clients connect. it's that simple.

There's nothing special happening at the server here.

Anway.. you understand your options now. good luck

18. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 8:16 PM (in response to clebert.suconic)

What was unexpected for me was the notion that the internal server-side object in question had the ability to affect message delivery in any way after the client process which caused its creation ceased to exist and a new one was started.

There's nothing special happening at the server.. period!

All that's happening is the message being buffered at the client. the server won't deliver those buffered messages until the client has failed.

You can just play with smallers TTL and you would have same behaviour you would have on JBoss MQ for instance.

Also: if you want strict order.. you can't have any other client receiving messages until the first one is gone. That's a math rule.. not much I can do to reinvent the wheel here.

If you want to reinvent the wheel... good luck.. you know all the options you have available.

19. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 8:19 PM (in response to clebert.suconic)

and you can also disable buffering.. but we will redeliver the message your client failed to ack. (the current message at the point the client failed).

20. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 9:03 PM (in response to clebert.suconic)

One last thing:

There's a new API in place at trunk where you can add metadata to your session on core.. and you can query through management for active connections following that metadata. You could use that so the new client could kill any live connections before it gets live.

The issue here is that the kernel won't close the socket connection on kill, like pulling the plug of the cable. So the server needs to relay on ping/pongs (TTL and checkPeriod respectively) to identify dead clients.

I have played with several other providers and this has been a standard behaviour I have seen.

21. Re: core client reconnect behavior

greggler Nov 11, 2010 9:39 PM (in response to clebert.suconic)

i still dont think we are speaking the same language but the test case i've provided should communicate the situation. i do not understand the theory of how messages transmitted to the server after a given client process has been destroyed at the OS level may be considered by the server to be buffered by that particular client. if this were a matter of delivery of messages transmitted prior to the client process being terminated and subsequently buffered into that client but not yet consumed then i would understand. however, we are talking about brand new messages initiated over a period of 71 seconds after the first client has terminated.

22. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 9:48 PM (in response to greggler)

The server doesn't know the client has died until the TTL has been completed.

The system will think you have two valid connected clients, hence the new client will start to pull messages out of the server.

As soon as the server perceives the failure of the previous clients.. all the messaging in delivering state will be re-delivered. At this point your new client already received messages... hence you will get the old ones out of the order.

So, in case of a kill (what should be an exception, not the rule), your new client should either wait the time to reconnect, by doing a query on the number of consumers, or you could also use management operations and the new metadata to have the client to close the other connections.

I will take a look at our test tomorrow to see if I'm misunderstanding anything you're saying here.

23. Re: core client reconnect behavior

clebert.suconic Nov 11, 2010 9:51 PM (in response to clebert.suconic)

"you could also use management operations and the new metadata to have the client to close the other connections."

this is actually a direct answer to one very first question: "how the new client process can identify itself to the server such that it gains control of the pre-existing consumer?"

(post edited this after a copy&paste error)

24. Re: core client reconnect behavior

timfox Nov 12, 2010 2:43 AM (in response to clebert.suconic)

Clebert Suconic wrote:

The server doesn't know the client has died until the TTL has been completed.

The system will think you have two valid connected clients, hence the new client will start to pull messages out of the server.

As soon as the server perceives the failure of the previous clients.. all the messaging in delivering state will be re-delivered. At this point your new client already received messages... hence you will get the old ones out of the order.

+1

This has also been discussed in detail in several other threads previously. Perhaps you should write a FAQ on it?

As soon as the HornetQ server knows the client is dead any unacked messages for that client will be returned to the queue. There is no delay between the server knowing the client is dead and this occurring.

How does the server know a client is dead? That's where ping/pong and connection ttl comes in (see user manual).

It's a common misconception that if your client suddenly disappears, e.g. the network cable between the client and server is pulled out, the router blows up or the client machine disappears in a puff of smoke, that the server will receive an exception.

In the general case *this will not occur*. This is nothing to do with HornetQ, this is the way the TCP protocol works. TCP is a "reliable" protocol, i.e. it will attempt to cope with lost packets by retransmitting them. If the client disappears the server side of the TCP connection just thinks it lost packets and will wait (for a considerable timeout) assuming it's a transitory problem and the client will retransmit them.

Therefore, in the general case it's not possible to "immediate" detect a client dying. Every messaging system (and other systems) use a ping-pong approach (or variation of) to check the liveness of connections. I.e. you send a ping and wait for a pong, and if you don't get a response back within time X then you assume the connection is dead and you can close it.

The time X in HornetQ is called connection-ttl.

So you should expect a pause between the client dying and the server knowing the client has died. There is nothing much we can do here to speed this up (or any other messaging system for that matter), other than tweaking the values of connection-ttl and client-failure-check-period.

25. Re: core client reconnect behavior

greggler Nov 12, 2010 12:18 PM (in response to timfox)

first, i'd just like to say that i do appreciate your work and the fantastic help you are both providing on this discussion. i am continuing to evaluate hornetq along with other choices and i'm sure the ultimate selection will involve quite a few factors with one being the fantastic responsivenes here on the forum. for me, this type of forum and community is what has always made JBoss and Red Hat such an effective product choice.

i agree that this is a fringe case but it is just one of many that i must test while upgrading a vital production messaging system. for my particular effort, success vs failure in selecting a new engine is measured less in terms of absolute througput and more in terms of guaranteed predictability over a set of relevant cases to our products.

i think we are communicating well and the language you use above to describe the situation is very much in line with the impression i had built. the docs describe the client-failure-check-period as relating to the client detecting that the server is gone. does it also apply to the server detecting the client is gone? so far, setting the client window to zero almost eliminates the problem except for one lone message out of order that i can handle with application-level code.

perhaps the connection parameters can be tuned further to deliver the behavior we need while restoring the performance benefits of client-side buffering but it is close enough right now that i can continue the eval. however, i must look at this in comparison to other engines and it is worth noting that we have not observed a similar situation with the older jbossmq. also, i put together an identical activemq test this morning that does not display the effect even with their client prefetch turned on.

26. Re: core client reconnect behavior

clebert.suconic Nov 12, 2010 12:33 PM (in response to greggler)

"display the effect even with their client prefetch turned on."

You're probably playing with low check periods. As we have been saying all along.. you can tune the values to something you need.

If you keep the TTL at 70 seconds... the server will wait 70 seconds until the client is considered down. If you use something more like 5 seconds.. your client will be identified much faster... faster than what you would need to restart the client.