10 Replies Latest reply on Sep 21, 2006 11:23 AM by vblagojevic

    state transfer failure handling

    vblagojevic

      Hey,

      Complete state transfer, as you might be aware, consists of transferring transient (in-memory) state, associated state (pojo reference map) and persistent state (state fetched from cacheloader). All three states are part of the stream passed between state recipient and state receiver.

      Current implementation does not follow all-or-nothing approach when it comes to potential failures. Let me elaborate. At state recipient we get a stream that contains all three states demarcated by special node markers and possibly some error markers if state generation at state provider threw exception(s). If we encounter error markers during stream reading we clear appropriate part of the cache at integration Fqn. Thus during transient state reading if we encounter error markers we clear transient state at state recipient cache node. Similarly if we encounter error markers during persistent state reading we clear underlying cacheloader recipient.

      As things stand right now we try to read persistent state from stream even though transient or associated transfer part of the stream contained error marker(s) and were thus not integrated in recipients cache.

      Should we keep the current approach and, if not, what are you arguments against the current approach?

        • 1. Re: state transfer failure handling
          brian.stansberry

          FYI, please note that the "current" above means the current HEAD code, which works somewhat differently from the currently released code (1.4.0.SP1).

          In 1.4.0.SP1:

          If the state provider has a failure marshalling transient, associated or persistent state, it returns null for the entire state transfer. The recipient interprets the null as a failure.

          On the recipient side, if there is a problem integrating the transient state, associated state is *not* integrated. Don't remember why I did it that way; probably because associated is useless without the transient state. However, if there's a problem integrating either transient or associated state, integration of persistent state is still allowed to proceed. That behavior of allowing persistent to proceed after a failure with transient goes back a long way.

          • 2. Re: state transfer failure handling
            manik

             

            "bstansberry@jboss.com" wrote:
            That behavior of allowing persistent to proceed after a failure with transient goes back a long way.


            Why is this though? I can see how this can be useful in some cases, but in others (e.g., if passivation is in use) it's usefulness may be limited.

            First impressions and all so far, but I'd lean towards an atomic approach here.

            • 3. Re: state transfer failure handling
              brian.stansberry

              My gut instinct is for an atomic approach as well.

              The only use case where continuing after partial failure is conceptually valid is when the persistent state contains a complete representation of the state; transient state is just there to provide a hot cache.

              1) There has to be a persistent state transfer or a shared cache loader.
              2) No passivation.

              Then, if you're going to fall back and rely on persistent state, you have to be sure you can remove any in-memory state that may have been integrated before the failure. If associated state fails, you have to clear transient as well.

              • 4. Re: state transfer failure handling
                belaban

                Do we lock the root node for the *persistent* state transfer, like in the case of the in-memory state ?
                Would it maybe be better to lock on the CacheLoader in question rather than on the root node ? Possible inconsistencies ?
                Usually, the persistent state is *much bigger* than the in-memory state, so locking the entire tree for the persistent state transfer might take a long time...

                • 5. Re: state transfer failure handling
                  manik

                  Good point.

                  In the cache loader/store interceptors, I no longer sync on the cache loader but on an object attributed to the fqn in question (improves concurrency in the loader/store interceptors) (see BaseCacheLoaderInterceptor)

                  But this means that locking on the cache loader during ST will not prevent the interceptors from writing to the loader. What you'd need to do is to get a loader lock on a per-fqn basis using the lock map in the BaseCacheLoaderInterceptor. This lock map will probably have to be refactored out of this Interceptor for this though, perhaps into the CacheLoaderManager.

                  • 6. Re: state transfer failure handling
                    vblagojevic

                    So lets bundle reading/writing transient and associated state as one atomic step since they deal exclusively with in-memory tree and reading/writing persistent state as another atomic step. Lets call them step 1 and step 2.


                    Is the algorithm you have in mind something along the lines of:

                    Reading:

                    obtain lock on transient node at integration point
                    do step 1
                    if any part of step 1 fails
                    clear transient tree at integration point and clear pojo map
                    release lock on transient node at integration point

                    obtain lock on CacheLoaderManager at node integration point
                    if (step 1 failed and it is a specific case Brian mentioned) or (step 1 success)
                    do step2
                    if step 2 fails clear cacheloader at node integration point
                    release lock on CacheLoaderManager at node integration point


                    Writing:

                    obtain lock on transient node
                    do step 1
                    release lock on transient node

                    obtain lock on CacheLoaderManager at node generation point
                    if (step 1 failed and it is a specific case Brian mentioned) or (step 1 success)
                    do step2
                    release lock on CacheLoaderManager at node generation point

                    Any corrections or suggestions?

                    • 7. Re: state transfer failure handling
                      vblagojevic

                      Guys,

                      Can we have some concensus on this one? How shall we proceed?

                      Vladimir

                      • 8. Re: state transfer failure handling
                        brian.stansberry

                        My feeling on this is still about a -0; i.e. I think its better to just do it atomically but I don't feel strongly about it. Doing it non-atomically requires careful coding that's going to be easy for later maintainers to break.

                        Also, the use case I described where continuing after failure is valid is based on the fact that the persistent data is "gold" while the in-memory data is just a speed optimization. That means that if the persistent transfer fails, the transferred in-memory data needs to be removed.

                        • 9. Re: state transfer failure handling
                          manik

                          I vote a +1 on atomicity. I wouldn't continue after failure, even if we have a complete persistent state and an incomplete transient one. Just easier to deal with.

                          • 10. Re: state transfer failure handling
                            vblagojevic

                            Ok atomic it is. If some special cases come up that need to be addressed we'll adjust it. Until then we keep it simple.