1 2 Previous Next 25 Replies Latest reply on Oct 24, 2007 12:13 PM by manik

    Locking tree before setting initial state

    brian.stansberry

      When an initial state transfer is received from another node, locks are acquired on the existing node's tree before the state is applied. I'm not sure what the benefit of this is.

      Acquiring locks will ensure that any existing locks must have been released before the state transfer is applied. But where would these locks have come from?

      1) A locally initiated tx. But there shouldn't be any of those; the application shouldn't expose the cache to callers until startService() has returned (Note that this assumes startService() can't return until a state transfer has been received and applied. That rule is not currently enforced, but I think it should be now that we can break locks for a state transfer).

      2) A prepare() call from a remotely initiated tx. But those locks will never clear before the _setState() call times out, as any commit() message is going to be stuck behind the _setState in the JGroups channel.

      In any case, what's the point? Any data in tree related to a preexisting tx will be overwritten anyway.

        • 1. Re: Locking tree before setting initial state
          belaban

          It could come from messages received on the JGroups channel. However, if there are no locks, we should be able to acquire the root lock very quickly right ?

          • 2. Re: Locking tree before setting initial state
            brian.stansberry

            My question about where the locks come from was meant to be rhetorical :)

            I think the locks come from prepare() messages sent by other nodes between the time the new node's channel is connected and it receives the state transfer from the coordinator. This causes setting the state transfer to fail, as the commit() call needed to release locks is stuck behind the _setState call.

            A good medium-term fix is the priority queue idea so the commit() can get through. But I wanted to think the situation through carefully to see if there is anything simple we can do for 1.3 (we've recently seen a couple support cases related to this). Namely, just force release any lock on the root node, acquire a write lock, discard any existing child nodes or data, and integrate the state transfer.


            So, to document my mental ramblings.....

            I believe there are 4 possible conditions related to such a prepare() call:

            1) READ_COMMITTED or stronger semantics are in force. The prepare() call was initiated by a 3rd cluster node (not the new one, not the coordinator). The coordinator hadn't received the prepare when it locked its tree to create the state transfer (otherwise locks on the coordinator would have interfered with the state transfer). Thus the transferred state will not reflect the prepare, and the message digest vector from the coordinator will not reflect the prepare either.

            So, what happens if the recipient ignores the existence of the prepare() by releasing any locks and overwriting the tree with the data from the state transfer? When the recipient gets a commit()/rollback() message from the node that sent the prepare, the channel will recognize its state does not reflect the prepare message, and will ask that the prepare be retransmitted. Thus the prepare will not be lost. However, this breaks transactional integrity, since if there is any problem applying the retransmitted prepare, the initiating node will never learn about it.

            Also, what happens when the prepare is retransmitted? The 1st transmission already caused changes in the cache state in things like the tx_table, lock_table etc. The state transfer doesn't wipe those clean. Will invoking the same prepare again cause problems? If so, in this case it would be useful if the receipient cache just ignored replicate() calls until it gets its state transfer.


            2) READ_COMMITTED or stronger semantics are in force. The prepare() call was initiated by a 3rd cluster node. The coordinator had received the prepare as well as a subsequent commit() or rollback() before it locked its tree to create the state transfer. So, no problem locking the tree -- the tx is complete on the coordinator. The transferred state will reflect both the prepare and the commit, and the message digest vector from the coordinator will also reflect both. But for some reason the state transfer recipient hadn't received the commit(). Thus it still holds the locks.

            Again, what happens if the recipient ignores the existence of the prepare() by releasing any locks and overwriting the tree with the data from the state transfer? If the recipient ever gets the missing commit() message from the node that sent the prepare, the channel will recognize its state already reflects the commit message, and will not pass the message up.

            But now, since the commit() was never received, eventually the local tx associated with the prepare will timeout and rollback. This will invalidly make reversing modifications to the cache. Here again, it would be useful if the receipient cache just ignored replicate() calls until it gets its state transfer.


            3) READ_UNCOMMITTED or weaker semantics are in force. The prepare() call was either initiated by the coordinator or a 3rd node, but no final commit had been processed by the coordinator. The coordinator would not have had any problems doing the state transfer because READ_UNCOMMITTED would have allowed it to read the tree. (This is a potential problem!!) In this case the transferred state will reflect the prepare but not the commit, and the message digest vector from the coordinator will also reflect just the prepare.

            If the recipient releases the locks and applies the state transfer, it will be applying uncommitted state. If a subsequent commit() message comes through, no problem. But if a rollback() comes through? Here having the modification entries from the original prepare() sitting in the tx_table would be helpful, as they can be used to execute the rollback().


            4) READ_UNCOMMITTED or weaker semantics are in force. The prepare() call was initiated by a 3rd cluster node. The coordinator hadn't received the prepare when it locked its tree to create the state transfer. Thus the transferred state will not reflect the prepare, and the message digest vector from the coordinator will not reflect the prepare either.

            If the recipient releases the locks and applies the state transfer, it will be discarding uncommitted state. If a subsequent commit()/rollback() message comes through, the channel will recognize its state does not reflect the prepare message, and will ask that the prepare be retransmitted. Thus the prepare will not be lost. However, this breaks transactional integrity, since if there is any problem applying the retransmitted prepare, the initiating node will never learn about it.

            Also, same as in case #1, what happens when the prepare is retransmitted? Will invoking the same prepare again cause problems? If so, it would be useful if the receipient cache just ignored replicate() calls until it gets its state transfer.


            In 3 of the 4 above cases, having the recipient cache ignore replication messages until it gets a _setState would be useful. For case #3, I'm starting to think if it's READ_UNCOMMITTED the coordinator should have to acquire a WRITE lock when preparing state. This will ensure uncommitted state isn't transferred. The write lock won't disturb reads from other threads on the coordinator, as its READ_UNCOMMITTED.

            But, re: doing something about this in 1.3, it doesn't sound so simple -- code-wise maybe easy, but thought-wise to see all the angles, not so simple.


            Pointing out any factual or logical errors in the above would be most appreciated :-)

            • 3. Re: Locking tree before setting initial state
              manik

              Hmm, ignoring replicate() calls until state is set seems to be the most logical and clear approach. In the end, state is influx until setState completes and such nodes should not participate in replicate calls anyway.

              Unfortunately ignoring replicate() calls could result in inconsistent state after a while if transactions prepare() and commit() while the state is being applied and replicate calls are being ignored (assuming there is some network latency in transferring the state).

              What about queueing replicate() calls until setState() completes? This may result in some tx's timing out if setState() takes a while, but at least then integrity across a cluster is maintained?

              • 4. Re: Locking tree before setting initial state
                brian.stansberry

                 

                "manik.surtani@jboss.com" wrote:
                Unfortunately ignoring replicate() calls could result in inconsistent state after a while if transactions prepare() and commit() while the state is being applied and replicate calls are being ignored (assuming there is some network latency in transferring the state).


                AFAIK, the JGroups state transfer protocol should take care of this. (Bela, please correct me if I'm wrong.) As part of the state transfer process, the recipient is given a digest of all the messages that went into creating the state. If some calls were discarded while the state transfer was being prepared/replicated, those messages shouldn't be part of the digest. So, the next time the recipient gets a message from one of the nodes that sent a discarded message, it will ask for the missing messages. But, that corrective action may not occur for a while if the other node doesn't send another message. Plus, when the messages are retransmitted, this is at the JGroups channel level, not in the RpcDispatcher. So if the prepare() fails (e.g. locking issue), the cache that initiated the gtx would not know.

                "manik.surtani@jboss.com" wrote:
                What about queueing replicate() calls until setState() completes? This may result in some tx's timing out if setState() takes a while, but at least then integrity across a cluster is maintained?


                In the partial state transfer case, we do something like this. PST occurs after the channel is started, and thus can't take advantage of the JGroups state transfer protocol and its message digest. So we queue up any replication traffic that comes in and process it after we get the PST.

                Note though that queuing the replicate() calls doesn't mean blocking and waiting for the state transfer. There is only one thread coming up from JGroups; we can't block that thread or the whole channel jams up. So, we just take the prepare() calls and throw them in a list. When the PST comes through from JGroups, we then apply the calls in the list. But here too we violate the integrity of the transaction, since the node that sent a prepare() doesn't know we just queued it; it thinks it went through successfully.


                I think the problem with transactional integrity exists any time there is a possibility of executing a prepare() not as part of an RPC call. This can happen either if JGroups retransmits the prepare after the state transfer, or because we queued up the prepare() to execute later. The only true solution I see for this is somthing like the FLUSH protocol to quiet the cluster while state transfer is going on. This would ensure there are no such prepare() calls (at the cost of a frozen application for a period of time).

                • 5. Re: Locking tree before setting initial state

                  Just a side note.

                  This is all focused on pessimistic locking semantics. And now I can't think of any but "FLUSH" to ensure data/transactional integrity.

                  And what happens when we use optimistic locking instead? We may not have the dead lock issue but how can make sure the state transfer always win (since it takes much longer to complete)?

                  • 6. Re: Locking tree before setting initial state
                    brian.stansberry

                    Good question; have to think it through a bit.

                    First, a key point is the only type of locks actually acquired on nodes are "pessimistic". Optimistic locking just defers the acquisition of the pessimistic lock by creating the workspace.

                    The state transfer code ignores optimistic locking. On the node that prepares state, read locks are acquired on all regular nodes in the tree. If there are any optimistic tx's going on, the writes are happening on workspace nodes that aren't going to be transferred. If one of those optimistic tx's tried to commit, it would have to wait for the state transfer to complete to acquire its lock. So, s/b no problem on that side.

                    On the state transfer recipient side, there shouldn't be any optimistic tx's, correct? The cache that is starting shouldn't be exposed yet to external callers, and only external callers create an optimistic tx. Any tx that is associated with a remote prepare() call is using pessimistic semantics, correct? The process of acquiring locks and applying changes from the workspace nodes to the regular nodes happened on the remote server; what's replicated are the changes.

                    • 7. Re: Locking tree before setting initial state
                      belaban

                      #1 The problem with ignoring prepare() is that the leader of the 2PC will think everything went fine, whereas it didn't

                      #2 The problem with queueing is that it will lead to incorrect state transfer: we cannot replay messages that are part of the state, that's the whole point of digests.

                      I think the prepare() calls should simply return an exception so that transactions will fail during state transfer.

                      Once I have implemented the FLUSH protocol (stop-the-world model) in JGroups, this point will be moot, because we wiil *not* receive any messages during state transfer

                      • 8. Re: Locking tree before setting initial state
                        brian.stansberry

                        Bela,

                        Offline you mentioned that you thought the state transfer sender should be acquiring a WRITE lock in all cases, not just READ_UNCOMMITTED. With JBCACHE-10, that can result in rolling back transactions that only read the cache. I'm not clear what that gains us, as READ_COMMITTED or stronger semantics will prevent acquiring a read lock if there is a concurrent write.

                        • 9. Re: Locking tree before setting initial state
                          belaban

                          Okay, you're right, so we're sticking with the RL then ? Or acquire a WL *only* for the READ_COMMITTED case ?

                          • 10. Re: Locking tree before setting initial state
                            brian.stansberry

                            To document it in detail for later readers:

                            On the state SENDER (e.g. the coordinator), we are only doing reads. But the goal is to ensure only committed state is transferred, since the state recipient may not have the info needed to apply any subsequent tx rollback. So:

                            RL if isolation level is READ_COMMITTED or stronger. A RL is sufficient to block if there are concurrent writes.
                            WL if isolation level is READ_UNCOMMITTED to ensure there aren't any concurrent uncommitted writes.
                            RL if isolation level is NONE since here there is nothing we can do.

                            On the state RECIPIENT (i.e. the new node) we are doing a write, so we acquire a WL.

                            • 11. Re: Locking tree before setting initial state
                              brian.stansberry

                              Following the previous post last March, a rather extensive discussion took place on private e-mail. 6 months later I'm going to recreate the highlights of that discussion here so the whole thread will be available in one place as we discuss these issues again next week:

                              "bela@jboss.com" wrote:
                              I thought about this, and I think the best thing is to *not* break the locks ! Let's look at the use cases.

                              #1 PREPARE was received and now state transfer request is received (state provider side)

                              We cannot break the locks acquired by PREPARE because we returned OK from PREPARE, so the coordinator assumes the TX will be successful (assuming for now everyone returned OK). Returning OK from PREPARE means we will be able to apply the changes, so we cannot break the locks acquired by PREPARE ! Otherwise, the result might be that everyone applied a TX, except the state provider, so we end up with inconsistent state...

                              Looks like we have to wait for the PREPARE locks to be released. This shouldn't take long though


                              #2 PREPARE is received after state transfer request has been received (again, on state provider side)

                              We cannot queue the PREPAREs, for similar reasons as discussed in #1: returning OK from a PREPARE means we successfully acquired the locks, but if we queue, we don't know whether we will be able to acquire the locks later !

                              We can throw an exception so we return FAIL from PREPARE, which means that during state transfer, nobody will be able to commit transactions across the cluster !

                              #3 PREPARE and FLUSH

                              Once we have FLUSH, we do the following:

                              a) On state transfer, we run the FLUSH protocol
                              b) It flushes pending multicasts out of the system so that all members have received the same messages after a FLUSH
                              c) It blocks members from sending new messages, this is done via the block() callback in MembershipListener
                              d) The block() needs to do the following:
                              --- Stop sending new messages (in ReplicationInterceptor)
                              --- Complete existing PREPARE rounds, e.g. return from block() only when a 2PC has been committed or rolled back
                              e) This means that when the state provider receives a state transfer request, the entire tree will be *lock-free*, because all 2PCs have completed. The only time this isn't true is when the 2PC leader crashed so we have dangling locks. However, Brian fixed this recently.

                              So what do we need to do once we have FLUSH and the block() callback is implemented correctly ?
                              We simply revert back to the original state transfer code, which acquires all locks on the state provider side (with small timeouts), and we don't need to break any locks !


                              So the strategy is to throw exceptions to PREPARE in JBossCache 1.3.0 (#2b), and in 1.4.0 we use solution #3 (depends on JGroups 2.4 with FLUSH).



                              • 12. Re: Locking tree before setting initial state
                                brian.stansberry

                                Continuing the recreation of the e-mail conversation...

                                "bstansberry@jboss.com" wrote:

                                Some comments on your analysis:

                                #1A. The way it is coded now is that if a lock is held by a non-local gtx that is STATUS_PREPARED, the lock is just broken; the tx is not affected. This is because as you noted once a non-local tx is STATUS_PREPARED it is too late to notify the intiator. So, the lock is just broken. This basically means the state provider is sending uncommitted state. If the gtx later commits, no harm, but if the initiator of the gtx later issues a rollback(), the state transfer recipient will have invalid state. Doing it this way is less likely to lead to inconsistent state than rolling back the tx on the state provider, but it is still not good.

                                #1B. The PREPARE locks won't ever be released (at least without adding something like a "priority queue" to JBC -- JBCACHE-326). The COMMIT message will be stuck in the JGroups channel behind the blocked getState() call.

                                #2. Won't any PREPARE received on the state provider side after the state transfer request just be waiting in the channel for the getState() to complete? However, these kinds of PREPARE calls are presumably being handled on the state transfer *recipient* side as well, and can thus lead to inconsistent state. So, having the *recipient* return FAIL from PREPARE is needed.

                                #3A5 -- Breaking locks for crashed members is triggered by receipt of a view change. But what if the view change is stuck in the channel behind the blocked getState() call? Again, we may need a "priority queue" concept to deal with this. This is tricky though, because here the "priority queue" is in JGroups, and basically means letting view change messages jump ahead of others. I certainly wouldn't think that could be the default behavior in JGroups, and even if it was a configurable option I imagine it could cause problems even in the JBC use case.

                                Actually, once FLUSH is in place, if the getState() call discovers a lock held by a remote gtx, the existence of the lock is actually proof of a dead (or at least very unhealthy) member, isn't it?

                                #3B1. Even with FLUSH, it is possible the getState() call won't be able to acquire a lock. This would be because a tx initiated on the state provider itself is holding the lock and has not yet started the 2PC phase. This kind of tx can safely be rolled back though. Other kinds of long lasting locally-initiated locks are possible too, though unlikely. E.g. a simple non-transactional put where the calling thread got hung up in a badly written TreeCacheListener. So, the lock-breaking code is still partly useful.


                                The killer issue is #1B. Until we do FLUSH or JBCACHE-326, the possibility exists that getState() will get stuck. Seems like in that situation we have a choice -- let the getState() call fail, or break the locks and accept the risk of inconsistent state. If we let the getState() call fail, then we have the choice of what to do on the recipient side -- abort the startup of the cache, or let the cache start without transferred state. Up to 1.3.0, JBC has been written to allow getState() to fail and let the recipient cache start without the transferred state.

                                My current (not very strong) inclination Bela is to agree with your daughter when she wrote last week that "The proper and *only* way IMO is to wait until FLUSH has been implemented in JGroups 2.4"
                                This is a problem that has existed in JBC all along. I don't see a good fix for it in 1.3. Adding a lot of changed behaviors that cause rolled back transactions but which don't really ensure a valid state transfer seems like making trouble. Maybe better to just leave the behavior from 1.2.4 and wait for the real fix in 1.4.

                                One thing I definitely think is that if we are going to allow the getState() call to fail, we shouldn't have the recipient return FAIL to prepare() calls while it waits for the state. That's just a nightmare -- rolled back transactions all over the place, and then the new cache ends up without a state transfer anyway :(



                                • 13. Re: Locking tree before setting initial state
                                  brian.stansberry

                                  Continuing the recreation of the e-mail conversation...

                                  "bela@jboss.com" wrote:


                                  #2. Won't any PREPARE received on the state provider side after the state transfer request just be waiting in the channel for the getState() to complete? However, these kinds of PREPARE calls are presumably being handled on the state transfer *recipient* side as well, and can thus lead to inconsistent state. So, having the *recipient* return FAIL from PREPARE is needed.


                                  True. So, if the state transfer takes longer than the PREPARE timeout, the initiator will return with a timeout exception and rollback the TX anyway. So we don't need to do anything here. If the state transfer takes less, the PREPAREs will succeeed *after* the state transfer has completed.

                                  #3A5 -- Breaking locks for crashed members is triggered by receipt of a view change. But what if the view change is stuck in the channel behind the blocked getState() call? Again, we may need a "priority queue" concept to deal with this. This is tricky though, because here the "priority queue" is in JGroups, and basically means letting view change messages jump ahead of others. I certainly wouldn't think that could be the default behavior in JGroups, and even if it was a configurable option I imagine it could cause problems even in the JBC use case.


                                  I don't feel very good about view changes jumping ahead of regular messages, this could badly screw up the QoS provided by JGroups. For
                                  example: Virtual Synchrony mandates that all messages sent in view V1 are *delivered* in V1 *before* V2 is installed (this is done through the FLUSH protocol), so having V2 jump ahead of regular messages violates this property.

                                  Actually, once FLUSH is in place, if the getState() call discovers a lock held by a remote gtx, the existence of the lock is actually proof of a dead (or at least very unhealthy) member, isn't it?


                                  Yes, absolutely ! The FLUSH will probably have to have a timeout associated with it, all locks still held after this phase can be broken.

                                  #3B1. Even with FLUSH, it is possible the getState() call won't be able to acquire a lock. This would be because a tx initiated on the state provider itself is holding the lock and has not yet started the 2PC phase. This kind of tx can safely be rolled back though. Other kinds of long lasting locally-initiated locks are possible too, though unlikely. E.g. a simple non-transactional put where the calling thread got hung up in a badly written TreeCacheListener. So, the lock-breaking code is still partly useful.


                                  *We *MUST* absolutely change this behavior and throw an exception if the state cannot be retrieved !!!*


                                  This is a problem that has existed in JBC all along. I don't see a good fix for it in 1.3. Adding a lot of changed behaviors that cause rolled back transactions but which don't really ensure a valid state transfer seems like making trouble. Maybe better to just leave the behavior from 1.2.4 and wait for the real fix in 1.4.


                                  Yes, I agree. With the change, of course, that we fail if the state
                                  cannot be retrieved. Maybe we should also increase the state transfer
                                  timeout in the XML default files



                                  • 14. Re: Locking tree before setting initial state
                                    brian.stansberry

                                    End result of all of the above was the opening of two JIRAs (now resolved) to ensure a failed state transfer led to failure to start the service:

                                    http://jira.jboss.com/jira/browse/JBCACHE-507
                                    http://jira.jboss.com/jira/browse/JBAS-2950

                                    The JIRA for breaking locks for state transfer (.http://jira.jboss.com/jira/browse/JBCACHE-315) was also reopened and pushed out.

                                    1 2 Previous Next