1 2 Previous Next 25 Replies Latest reply on Oct 24, 2007 12:13 PM by manik Go to original post
      • 15. Re: Locking tree before setting initial state
        brian.stansberry

        Thought a bit about the idea of locking / on the state recipient before it calls channel.getState(). The idea is to ensure that the received state can be integrated w/o concern about not being able to acquire the required read lock. Problems I see:

        1) Race with the state provider, where the provider issues a prepare while the recipient issues getState(). Result is a deadlock until either the state transfer or the prepare() times out, or the tx that issued the prepare is rolled back on the state provider.

        Actually, even worse. The prepare() call will be waiting to lock / and blocking the jg uphandler thread on the state recipient . That will prevent receipt of any rollback() message or any subsequently transferred state, so the state transfer will be unable to go through until the prepare() fails with a lock timeout.

        2) Race with a 3rd node in the cluster where the prepare() gets to the state provider before the getState() call. Again deadlock until the prepare() times out.

        • 16. Re: Locking tree before setting initial state
          brian.stansberry

          Vladimir,

          FYI, the stuff I wrote before for breaking locks, rolling back transactions for state transfer can be found in o.j.c.lock.LockUtil.forceAcquireLock(). This would have been called from StateTransferManager.acquireLocksForStateTransfer() line 401, but it was disabled per the previous discussions in this thread.

          This was written not for a FLUSH/block use case, but rather to be called as part of a getState() call. See http://jira.jboss.com/jira/browse/JBCACHE-315. So, it can't be used directly for FLUSH, but elements may be useful. It's kind of ugly code though; needed another round or two of polishing when we decided to shelve it. And lots of testing. There is an o.j.c.statetransfer.ForcedStateTransferTest that's meant for this; it should fail now due to code being disabled and is ignored by cruisecontrol.

          Issues with LockUtil.forceAcquireLock():

          1) It combines a) breaking locks and b) waiting for tx's to complete followed by attempted rollback of those that don't into one operation. That most likely should be somewhat separated.
          2) It deals with transactions in all stages, including ACTIVE. We only want to deal with ACTIVE transactions on the node that is preparing the state transfer. There's no reason to touch an active transaction on some 3rd node, as the locks it's holding will not affect anything.

          A tricky thing there is during block() how does a node know its going to be preparing a state transfer (and therefore needs to deal with ACTIVE transactions)? It's not as simple as checking if the node is coordinator, because a partial state transfer call can go to any node.

          Perhaps an algorithm would be:

          1) In block() on all nodes
          a) set a flag or something that will block any ACTIVE transactions from proceeding (i.e. entering prepare() phase of 2PC)
          b) wait for complete / rollback any transactions that are beyond ACTIVE

          2) In getState() on node that's providing state, execute something pretty much like what I already wrote. That will deal with
          a) any ACTIVE transactions on that node
          b) any leftover locks held by sick nodes that didn't properly clean up after themselves in block()

          3) What's the call to "unblock"? How does the cache know its safe to resume normal operations (e.g. revert flag set in 1a)?

          • 17. Re: Locking tree before setting initial state
            vblagojevic

            Brian,

            There is no way that at the time block is propagate up from FLUSH to Channel that we know which node is a state provider. The only parameter with BLOCK event that we could possibly send up the stack is the identity of the node that invoked FLUSH. In the case of state transfer that node is state requester.

            So as you suggest we'll have to do some additional work once getState reaches state provider.

            The call for unblock, as Bela explained today privately, is:

            Pull model (Channel.receive()):
            Channel.blockOk()


            Push model (Receiver.block()):
            Returning from block() is an ack, so Channel.blockOk() does *not* need to be called !


            • 18. Re: Locking tree before setting initial state
              brian.stansberry

              OK, so there's no "unblock" call. That seems a minor flaw in FLUSH, as it doesn't give the application an opportunity to make any kind of state transition upon completion of the whole FLUSH cycle. The only knowledge the app has that the FLUSH is finished is that suddenly messages start flowing up/down the channel again.

              No matter; that's an issue for another day and I don't think an issue for the current problem. Refining what I proposed last time. I'll use the pull model, although TreeCache implements MembershipListener so I think it would get the block() call.

              1) Receive BlockEvent on all nodes
              a) set a flag or something that will block any ACTIVE transactions from proceeding (i.e. entering prepare() phase of 2PC)
              b) wait for complete / rollback any transactions that are beyond ACTIVE
              c) call Channel.blockOk().
              d) reverse the flag set in a) so transactions can now enter prepare(). But, the channel will prevent the prepare() call going out on the cluster

              2) In getState() on node that's providing state, execute something pretty much like what I already wrote to deal with ACTIVE transactions on that node or any leftover locks held by sick nodes.

              3) No need for #3. Flag already reversed in 1d; when FLUSH is done, the blocked prepare() calls will start to flow.

              • 19. Re: Locking tree before setting initial state
                vblagojevic

                Ok I see. You believe that an event/callback at JChannel level indicating that FLUSH completed is useful. I think soo too. Lets see what Bela says.

                I understand your algorithm for locking better now.

                • 20. Re: Locking tree before setting initial state
                  belaban

                  Hmm, not so good because that would mean I have to add another callback blockDone() in the MembershipListener...
                  What do we do about compatibility ?

                  • 21. Re: Locking tree before setting initial state
                    brian.stansberry

                    After I thought through the algorithm some more, I don't think we need it for the immediate issue. So adding that could be more of a future thing, maybe 3.0 or something.

                    • 22. Re: Locking tree before setting initial state
                      vblagojevic


                      Following up discussion we had on jbosscache-dev. In summary, Brian found a fundemental problem in FLUSH that needed to be resolved before attacking JBCACHE-315. Problem is described below. FLUSH was retrofitted in JGroups 2.4 final to include solution Brian describes in last paragraph.


                      Brian said on jbosscache-dev:

                      We have a problem in that the FLUSH protocol makes the decision to shut off the ability to pass messages down the channel independently at each node. The protocol doesn't include anything at the JGroups level to readily support coordination between nodes as to when to shut off down messages. But, JBC needs coordination since it needs to make RPC calls around the cluster (e.g. commit()) as part of how it handles FLUSH.

                      Basically, when the FLUSH protocol on a node receives a message telling it to START_FLUSH, it calls block() on the JBC instance. JBC does what it needs to do, then returns from block(). Following the return from
                      block() the FLUSH protocol in that channel then begins blocking any further down() messages.

                      Problem is as follows. 2 node REPL_SYNC cluster, A B where A is just starting up and thus initiates a FLUSH:

                      1) JBC on B has tx in progress, just starting the 2PC. Sends out the prepare().
                      2) A sends out a START_FLUSH message.
                      3) A gets START_FLUSH, calls block() on JBC.
                      4) JBC on A is new, doesn't have much going on, very quickly returns from block(). A will no longer pass *down* any messages below FLUSH.
                      5) A gets the prepare() (no problem, FLUSH doesn't block up messages, just down messages.)
                      6) A executes the prepare(), but can't send the response to B because FLUSH is blocking the channel.
                      7) B gets the START_FLUSH, calls block() on JBC.
                      8) JBC B doesn't immediately return from block() as it is giving the
                      prepare() some time to complete (avoid unnecessary tx rollback). But
                      prepare() won't complete because A's channel is blocking the RPC response!! Eventually JBC B's block() impl will have to roll back the tx.

                      Basically you have a race condition between calls to block() and
                      prepare() calls, and can have different winners on different nodes.

                      A solution we discussed, rejected and then came back to this evening (please read FLUSH.txt to understand the change we're discussing):

                      Channel does not block down messages when block() returns. Rather it just sends out a FLUSH_OK message (see FLUSH.txt). It shouldn't initiate any new cluster activity (e.g. a prepare()) after sending FLUSH_OK, but it can respond to RPC calls. When it gets a FLUSH_OK from all the other members, it then blocks down messages and multicasts a FLUSH_COMPLETED to the cluster.

                      • 23. Re: Locking tree before setting initial state
                        vblagojevic

                        Here is a transcript of conversation that I had with Brian regarding the algorithm details of JBCACHE-315:

                        Replying privately, but think we should take this to jbcache-dev or the forum. This is complex and Vladimir Blagojevic wrote:
                        > Hey Brian,
                        >
                        > I though a bit more about the locking algorithm and I would like to
                        > bounce it off you. If you recall we agreed on our phone call that we
                        > have to go through the steps of:
                        >
                        > a) set a flag or something that will block any ACTIVE transactions
                        > from proceeding (i.e. entering prepare() phase of 2PC)
                        >
                        > We then revised this by saying that infact "any ACTIVE transactions"
                        > should rather be "any ACTIVE locally initiated transactions". We also
                        > agreed how we can do this by using a latch in TxInterceptor.
                        >
                        >
                        > b) wait for completion of any transactions that are beyond ACTIVE.
                        >
                        >
                        > We thought that this was great idea but soon realized that
                        > transactions can still deadlock. You said:"For example, locally
                        > initiated transaction is holding a lock on some node and you have
                        > remote prepare that comes in. Remote won't be able to acquire lock. At
                        > some point we have to deal with that. Whoever sent that prepare call
                        > isn't going to proceed - sender will block on that synchronous call.
                        > So on remote node prepare is not going to be progressing."

                        To be even more specific:

                        On the remote node (i.e. the one we're working on) the JG up-handler thread will be blocked while the prepare() call waits to acquire a lock. That thread will block until there is a lock timeout. This will occur whether we are using REPL_ASYNC or REPL_SYNC. One effect of this is that no other JG messages will be received until there is a timeout. Note also that *I think* that having another thread roll back the tx associated with the prepare call will not cause the JG up-handler thread to unblock!!

                        If REPL_SYNC, on the node that originated the GTX, the client thread that committed the tx will be blocking waiting for a response to the prepare() call.

                        >
                        >
                        > I have a another proposal. If we already have to introduce a latch why
                        > not introduce it in "better" location. So the proposal is to introduce
                        > our latch in InvocationContextInterceptor rather than in
                        > TxInterceptor.
                        > InvocationContextInterceptor is always first interceptor in the chain.
                        > By introducing a latch here we can inspect a call and determine its
                        > origin and transactional status and block transactions prior to them
                        > grabbing any locks.

                        Can this be done in the TxInterceptor? I.e. isn't it always before any LockInterceptor? I would think it would be. I expect Manik would put up a fuss about doing tx-related stuff outside TxInterceptor; the whole reason it was added in 1.3.0 was to encapsulate stuff that was previously spread around other interceptors.

                        > If transaction
                        > is originating locally and has not been registered in
                        > TransactionalTable (had not yet performed any operation) block it on
                        > latch prior it has a chance to acquire any locks.

                        +1. No reason to let a tx that hasn't acquired any locks go through and cause trouble.

                        > Then we look at the table and rollback any local transactions that
                        > have not yet gone to prepare i.e transactions that we have missed with
                        > our latch. If any rollbacked transaction retries it will be caught by
                        > our latch:) All other transactions we let go through. Start a timer
                        > and give it enough time to have beyond prepare transactions finish.
                        >
                        > So in pseudocode, algorithm executed on each node:
                        >
                        > receive block call
                        > flip a latch in InvocationContextInterceptor and block any subsequent
                        > local transactions
                        > rollback local transactions if not yet in prepare phase and start
                        > timer T (allow some time for beyond prepare transactions to finish)
                        > if lock still exists at integration node after T expires rollback our
                        > local transaction
                        > flip latch back and allow transactions to proceed
                        > return from block ok
                        >
                        >
                        > flush blocks all down threads (thus no prepare will go through
                        > although local transactions will proceed on each node)
                        >
                        > Proceed with algorithm on state provider:
                        >
                        > receive getState
                        > grab a lock on integration point using LockUtil.breakLock variant
                        > possibly rolling back some local
                        > transactions read state and do state transfer with state receiver
                        >
                        > when state transfer is done prepare messages will hit the cluster and
                        > state will be consistent
                        > no matter what happens will all global transactions
                        >
                        >

                        The concern I have with this is we give up one of the key goals -- not rolling back a tx if its not hurting anything. Here we assume that an ACTIVE locally originated tx is going to cause a problem by blocking a remote prepare() call. So we roll back the tx. Actually the odds of a remote prepare() call being blocked are pretty low.

                        How about this:

                        1) receive block call
                        2) flip a latch in TxInterceptor (I'm assuming it will work putting it here instead of InvocationContextInterceptor). This latch is used at 2 or 3 different control points to block any threads that:
                        a) are not associated with a GTX (i.e. non-transactional reads/writes from the local server)
                        b) are associated with a GTX, but not yet in TransactionTable (your idea above)
                        c) are associated with a locally originated GTX and are about to enter the beforeCompletion phase (i.e. the original idea of preventing the tx from proceeding to making a prepare() call.)
                        3) Loop through the GTXs in the TransactionTable. Create and throw in a map little object for each GTX. Object is a state machine that uses the JTA Transaction.STATUS, the elapsed time and whether the tx is locally or remotely originated to govern its state transitions. Keep looping through the TransactionTable, create more of these objects if more GTXs appear and for each GTX update the object with the current Transaction.STATUS, then read the object state. The object state tells you whether you need to rollback the tx, etc.
                        4) If a the state machine is for a *remotely initiated* GTX that's in ACTIVE status, after some elapsed time its state will tell you that its likely held up by a lock conflict with a locally originated tx. At that point we have a choice.
                        a) roll back all locally originated tx's that are ACTIVE or PREPARED. Con: indiscriminately breaks transactions. Con: if tx has already entered beforeCompletion() we don't know whether its in prepare() call or later. We can only roll it back during beforeCompletion(); otherwise we introduce a heuristic.
                        b) roll back the remotely originated tx. Pro: doesn't indiscriminately break transactions. Con: I *think* this rollback won't unblock the JG up-handler thread.
                        5) We'd need to work out all the state transitions; i.e. what conditions lead to tx rollback.
                        6) flip latch back and allow transactions to proceed
                        7) return from block ok

                        >
                        >
                        >
                        > So in summary the goal of the first part of the algorithm is to allow
                        > transactions beyond prepare to finish and prevent any local
                        > transactions from hitting the cluster and becoming global. That leaves
                        > us dealing only with local transactions at state provider in the
                        > second part of the algorithm. In the second part with deal just with
                        > state provider. We grab lock at integration point, possibly rollback
                        > any local transaction there, do state transfer and let the prepares
                        > hit the cluster thus preserving state consistency and disturbing least
                        > number of global transactions.
                        >
                        >
                        > It seems like if we had that blockDone Jgroups callback then things
                        > would be nicer,algorithm executed on each node:
                        >
                        >
                        > receive block call
                        > flip a latch in InvocationContextInterceptor and block any subsequent
                        > local transactions
                        > rollback local transactions if not yet in prepare phase and start
                        > timer T
                        > if lock still exists at integration node after T expires rollback our
                        > local transaction
                        > return from block ok
                        >
                        >
                        > flush blocks all down threads (thus no prepare will go through
                        > although local transactions will proceed on each node)
                        >
                        > Proceed with algorithm on state receiver and provider:
                        >
                        > do state transfer
                        >
                        > Proceed with algorithm executed on each node:
                        >
                        > flip latch back and allow transactions to proceed
                        >
                        >

                        Here's a question for you about FLUSH: When a service returns from block() or sends blockOK, does the channel immediately block? Is there coordination across the cluster?

                        My concern:

                        Node A doesn't have much going on, quickly returns from block(), so his channel is blocked.

                        Node B takes a little longer; has some txs the completion of which requires sending messages to A. Those messages don't get through due to A being blocked.

                        • 24. Re: Locking tree before setting initial state
                          vblagojevic

                          There is a rather big conceptual FLUSH change in JGroups 2.6. In 2.6 we always allow unicasts messages to pass down through FLUSH.down() latch. Bela and I realized that virtual synchrony property is relevant only to multicast messages and that FLUSH should not be concerned with unicasts at all.

                          This has potentially a big impact on JBCACHE-315. Lets reopen discussion.

                          • 25. Re: Locking tree before setting initial state
                            manik

                            This can be a problem. With buddy replication, most messages *are* unicasts, to specific members in a buddy group.

                            1 2 Previous Next