10 Replies Latest reply on Sep 21, 2006 11:23 AM by vblagojevic

state transfer failure handling

vblagojevic Sep 13, 2006 3:27 PM

Hey,

Complete state transfer, as you might be aware, consists of transferring transient (in-memory) state, associated state (pojo reference map) and persistent state (state fetched from cacheloader). All three states are part of the stream passed between state recipient and state receiver.

Current implementation does not follow all-or-nothing approach when it comes to potential failures. Let me elaborate. At state recipient we get a stream that contains all three states demarcated by special node markers and possibly some error markers if state generation at state provider threw exception(s). If we encounter error markers during stream reading we clear appropriate part of the cache at integration Fqn. Thus during transient state reading if we encounter error markers we clear transient state at state recipient cache node. Similarly if we encounter error markers during persistent state reading we clear underlying cacheloader recipient.

As things stand right now we try to read persistent state from stream even though transient or associated transfer part of the stream contained error marker(s) and were thus not integrated in recipients cache.

Should we keep the current approach and, if not, what are you arguments against the current approach?

1. Re: state transfer failure handling

brian.stansberry Sep 13, 2006 3:47 PM (in response to vblagojevic)

FYI, please note that the "current" above means the current HEAD code, which works somewhat differently from the currently released code (1.4.0.SP1).

In 1.4.0.SP1:

If the state provider has a failure marshalling transient, associated or persistent state, it returns null for the entire state transfer. The recipient interprets the null as a failure.

On the recipient side, if there is a problem integrating the transient state, associated state is *not* integrated. Don't remember why I did it that way; probably because associated is useless without the transient state. However, if there's a problem integrating either transient or associated state, integration of persistent state is still allowed to proceed. That behavior of allowing persistent to proceed after a failure with transient goes back a long way.
Actions
2. Re: state transfer failure handling

manik Sep 13, 2006 8:07 PM (in response to vblagojevic)

"bstansberry@jboss.com" wrote:
That behavior of allowing persistent to proceed after a failure with transient goes back a long way.

Why is this though? I can see how this can be useful in some cases, but in others (e.g., if passivation is in use) it's usefulness may be limited.

First impressions and all so far, but I'd lean towards an atomic approach here.
Actions
3. Re: state transfer failure handling

brian.stansberry Sep 13, 2006 11:20 PM (in response to vblagojevic)

My gut instinct is for an atomic approach as well.

The only use case where continuing after partial failure is conceptually valid is when the persistent state contains a complete representation of the state; transient state is just there to provide a hot cache.

1) There has to be a persistent state transfer or a shared cache loader.
2) No passivation.

Then, if you're going to fall back and rely on persistent state, you have to be sure you can remove any in-memory state that may have been integrated before the failure. If associated state fails, you have to clear transient as well.
Actions
4. Re: state transfer failure handling

belaban Sep 14, 2006 2:56 AM (in response to vblagojevic)

Do we lock the root node for the *persistent* state transfer, like in the case of the in-memory state ?
Would it maybe be better to lock on the CacheLoader in question rather than on the root node ? Possible inconsistencies ?
Usually, the persistent state is *much bigger* than the in-memory state, so locking the entire tree for the persistent state transfer might take a long time...
Actions
5. Re: state transfer failure handling

manik Sep 14, 2006 8:32 AM (in response to vblagojevic)

Good point.

In the cache loader/store interceptors, I no longer sync on the cache loader but on an object attributed to the fqn in question (improves concurrency in the loader/store interceptors) (see BaseCacheLoaderInterceptor)

But this means that locking on the cache loader during ST will not prevent the interceptors from writing to the loader. What you'd need to do is to get a loader lock on a per-fqn basis using the lock map in the BaseCacheLoaderInterceptor. This lock map will probably have to be refactored out of this Interceptor for this though, perhaps into the CacheLoaderManager.
Actions
6. Re: state transfer failure handling

vblagojevic Sep 14, 2006 11:48 AM (in response to vblagojevic)

So lets bundle reading/writing transient and associated state as one atomic step since they deal exclusively with in-memory tree and reading/writing persistent state as another atomic step. Lets call them step 1 and step 2.

Is the algorithm you have in mind something along the lines of:

Reading:

obtain lock on transient node at integration point
do step 1
if any part of step 1 fails
clear transient tree at integration point and clear pojo map
release lock on transient node at integration point

obtain lock on CacheLoaderManager at node integration point
if (step 1 failed and it is a specific case Brian mentioned) or (step 1 success)
do step2
if step 2 fails clear cacheloader at node integration point
release lock on CacheLoaderManager at node integration point

Writing:

obtain lock on transient node
do step 1
release lock on transient node

obtain lock on CacheLoaderManager at node generation point
if (step 1 failed and it is a specific case Brian mentioned) or (step 1 success)
do step2
release lock on CacheLoaderManager at node generation point

Any corrections or suggestions?
Actions
7. Re: state transfer failure handling

vblagojevic Sep 18, 2006 11:08 AM (in response to vblagojevic)

Guys,

Can we have some concensus on this one? How shall we proceed?

Vladimir
Actions
8. Re: state transfer failure handling

brian.stansberry Sep 18, 2006 2:56 PM (in response to vblagojevic)

My feeling on this is still about a -0; i.e. I think its better to just do it atomically but I don't feel strongly about it. Doing it non-atomically requires careful coding that's going to be easy for later maintainers to break.

Also, the use case I described where continuing after failure is valid is based on the fact that the persistent data is "gold" while the in-memory data is just a speed optimization. That means that if the persistent transfer fails, the transferred in-memory data needs to be removed.
Actions
9. Re: state transfer failure handling

manik Sep 21, 2006 6:18 AM (in response to vblagojevic)

I vote a +1 on atomicity. I wouldn't continue after failure, even if we have a complete persistent state and an incomplete transient one. Just easier to deal with.
Actions
10. Re: state transfer failure handling

vblagojevic Sep 21, 2006 11:23 AM (in response to vblagojevic)

Ok atomic it is. If some special cases come up that need to be addressed we'll adjust it. Until then we keep it simple.
Actions

Go to original post