6 Replies Latest reply on Nov 8, 2006 9:50 AM by vblagojevic

    JBC partial state transfer and Region.activate()

    vblagojevic

      Hey guys,

      I spent previous two days investigating new JGroups callback based partial state transfer implemented in JBC HEAD. Previous JBC releases relied on RPC based mechanism to implement partial state transfer. We are trying to move partial state transfer to JGroups API. The main reason why we want to do this is to harness benefits of FLUSH protocol from JGroups. I have a working version that is elegant and does not require a lot of coding. However, I have discovered that concurrent activation test is not always passing.

      Concurrent activation test starts N cache nodes and each node concurrently activates certain region. Underneath, for each activate requests the plumbing does partial getState with flush. In order to have this work reliably I have to implement a feature in FLUSH [1] that supports concurrent flushing. Bela and I talked about this problem on our conference call and he gave me some ideas how to do this a ala concurrent jgroups merge.


      This task seems like a priority for jbc 2.0. What should I do in terms of JBC release plan? I can have partial state transfer implemented soon but as I mentioned above concurrent activations will not work reliably until [1] is resolved. Bela has agree that we will include this feature/fix in JGroups 2.4 service pack but delivery date of for this SP is very undetermined. My estimate is that coding and testing [1] along with testing of partial state transfer and concurrent activation in JBC will take around 3 weeks.

      Regards,
      Vladimir

      [1] http://jira.jboss.com/jira/browse/JGRP-332

        • 1. Re: JBC partial state transfer and Region.activate()
          brian.stansberry

          How bad is the concurrency problem? Is it a race condition, where if two services start a FLUSH at nearly the same time, there's an issue? Or is it worse, i.e. service A on node 1 starts a flush/state transfer, 5 secs later service B on node 2 starts a flush while the A state transfer is still in progress, and something fails? Assume use of the mux here.

          I'm trying to get a sense of the scope of the problem so we can decide priorities relative to getting the AS 5 beta done. For the AS 5 beta we need partial state transfer working; deploying web apps requires it. So the question is, do we restore the old RPC-based partial state transfer that got stripped out of 2.0 somewhere along the line, or do we go with the FLUSH based one that has a known problem. It's a beta, so I don't think having a known problem is the end of the world. But if the problem is so bad that's it's going to occur very frequently, we need to consider restoring the RPC approach for now.

          • 2. Re: JBC partial state transfer and Region.activate()
            vblagojevic

            It is a race condition. Problem will manifest if we have many cluster nodes (4+) and each of these nodes does concurrent activation i.e partial state transfer. I'll work more on this over the weekend I will update you with details.

            OTH, it would not be hard to restore RPC approach. We had all the test working until very recently (Oct 12,2006). Not more than a 2-3 days of coding and testing to restore it.

            • 3. Re: JBC partial state transfer and Region.activate()
              brian.stansberry

              Thanks. 2-3 days effort is quite a bit considering we'll turn around and throw it away a few weeks later, so if we can avoid it that's definitely preferable.

              Here's a scenario / potential test:

              Create a cache with say, 15 regions. Put some data in each of the regions.

              2 threads. Each creates a cache then goes into a loop where it activates the 15 regions, with a 1 sec pause between activations.

              Start one of the threads, wait 10 secs then start the other. See if both threads complete successfully.

              That simulates a 10 sec staggered start of 2 servers in a cluster, with each server then deploying webapps. If that test can pass more than 90% of the time, I think it's fine for the initial beta.

              • 4. Re: JBC partial state transfer and Region.activate()
                vblagojevic

                Thanks Brian. I will make a unit test as you suggest - something similar to current concurrent test.

                • 5. Re: JBC partial state transfer and Region.activate()
                  manik

                  Regarding restoring the RPC based mechanism, we'd also need to roll back the queueing code in the marshalling Regions which I removed.

                  I'm with Brian in that I'd prefer not to roll back to the RPC mechanism.

                  • 6. Re: JBC partial state transfer and Region.activate()
                    vblagojevic

                    I wrote the test that Brian suggested and it is passing. I am in finalization state of integrating this code. I am veryfing the effect of having FLUSH in replSyncService.xml and replAsyncService.xml on other tests.