2 Replies Latest reply on Dec 10, 2005 10:52 AM by timfox

Channel may lose persistent message (the black hole problem)

ovidiu.feodorov Nov 29, 2005 5:25 AM

The way ChannelSupport.handleNoTx() is implemented right now is potentially dangerous: reliable messages are first removed from the persistent storage and they are hold in memory for a brief period of time. This is incorrect, should a crash occur, the message is lost.

1. Re: Channel may lose persistent message (the black hole prob

timfox Nov 29, 2005 7:28 AM (in response to ovidiu.feodorov)

"ovidiu.feodorov@jboss.com" wrote:
The way ChannelSupport.handleNoTx() is implemented right now is potentially dangerous: reliable messages are first removed from the persistent storage and they are hold in memory for a brief period of time. This is incorrect, should a crash occur, the message is lost.

[A similar problem also occurs in ChannelSupport.cancel(...).]

The problem is related to the fact that the channel state is stored in two different places; the message ref list and the delivery map.

When a message either goes from being undelivered to delivered, or delivered to cancelled this results in us removing from one and adding to the other. Causing a period of time where the message is not represented in either the delivery or message ref collections.

Making sure we add the ref to one collection before removing from the other does not solve our problem, then if it crashes we end up with the ref in one of the collections twice.

I can see two solutions here.

1) Before the move from one collection to the other we add (and persist) a flag on the ref to the state in the db. If it crashes, on recover we pick up the flag and we know where to put the ref.

This is a big pain in the butt and is going to have performance implications.

2) Don't store deliveries separately from message refs.
Instead represent the channel state as a the linked list of message refs. Delivery can then be represented as a flag on the state.
Moving a ref to the delivering state is then simply a matter of flipping a flag and can be made atomic. A lot simpler.
We can keep the Delivery concept, this is just a matter of where we store the state.
Actions
2. Re: Channel may lose persistent message (the black hole prob

timfox Dec 10, 2005 10:52 AM (in response to ovidiu.feodorov)

Another solution which is probably the easiest to implement with our current architecture goes as follows:

Currently the "atomic" operations on the channel State are essentially as follows:

add(Delivery d);
add(MessageReference r);
remove(Delivery d);
remove(MesageReference r);

For the case of a RecoverableState, this map one-to-one to a JDBC transaction in the persistence manager. Hence we get the black hole problem.

Instead, if we think about the operations that really should be atomic, they're actually the following:

1)
add(Delivery d) - this happens when a new message arrives on the queue and is immediately delivered

2)
redeliver(MessageReference ref) - this happens when a message that's already in the channel state is successfully delivered.
It comprises a remove(MessageReference r) and an add(Delivery d)

3)
cancel(Delivery d) - this happens when a delivery is cancelled - it comprises a remove(Delivery d) and an add(MessageReference ref)

4)
acknowledge(Delivery d) - this happens when a delivery is acked, it comprises a remove(Delivery del)

If we imagine the above as our "atomic" operations on the state, and make sure each one is performed in a single JDBC transaction (not to be confused with our JMS transactions), then our problem is solved.

We will of course have to amend PersistenceManager in order to do the above operations in a single jdbc tx.

This should be fairly straightforward to accomplish.

Further down the line we can optimise this further and batch multiple of the above "atomic" operations into a single JDBC transaction - this would be a good optimisation for JMS transactions (don't confuse with JDBC transactions) which have multiple actions.

And going even further we can also utilise DB batch updates where the db supports that.

I suggest changing the state interface as described as above (and the persistence manger), to solve our current problems and we can do the futher optimisations later on as separate JIRA tasks.

Tim
Actions

Go to original post