I recently began setting a TTL (time to live) on outgoing messages ( I was using 20000 milliseconds) and mostly it ran fine. However, I noticed that if there were a clock skew between the node on which I was running the hornetq/publisher and the node on which I was running the subscriber ( namely the subscriber's clock was more than 20 seconds ahead of the publishers) then messages didn't get delivered. On the other hand if I sync'ed the clocks, then the messages came right through. This only happended if the subscribers' clock was ahead of the publisher/jms clock. (Not sure if its important but I was sending from linux to windows)
My hypothesis - I haven't looked at the code - is that the publisher is somehow setting the "stale time" for the message absolutely (even thought the value itself is relative)? This seems wrong to me. Is it intentional?
|Step||HQ Node Time||Msg Time To Live||Msg Stale Time||Sub Node Time|
|Message is published to HQ Node||10:46:00||20 seconds (20000 ms)||10:46:20||10:48:00|
|Message available on HQServer (padded for clarity)||10:46:01||20 sec||10:46:20||10:48:01|
|Message retrieved by Subscriber Node||10:46:02||20 sec||10:46:20||10:48:02|
In this example, the message is not received/not delivered by client on Subscriber Node since that client sees the time as 10:48:02 at the time of retrieval and the "Msg Stale Time" has passed.
It seems to me that getting this absolutely correct from the time the "publish" call was made to the exact time every subscribers "receive" call was made would need to involve a complex scheme of interrogating the clocks in a system and accounting for the deltas. However, it seems that if HQ was to interpret TimeToLive as how long the message is allowed to reside on the HQ server (as opposed to the pre-fetch and post-send buffers local to the senders and receiver nodes), then this would be relatively easy to do reliably and would accomplish the desired effect of removing old undelivered messages which were clogging up the server (instead of PAGING or BLOCKING them, or using up tons of memory). There would be some complexity for clustered instantiations, but it seems that there may already be some mechanism for synchronizing the clocks across multiple servers?