How bad is the concurrency problem? Is it a race condition, where if two services start a FLUSH at nearly the same time, there's an issue? Or is it worse, i.e. service A on node 1 starts a flush/state transfer, 5 secs later service B on node 2 starts a flush while the A state transfer is still in progress, and something fails? Assume use of the mux here.
I'm trying to get a sense of the scope of the problem so we can decide priorities relative to getting the AS 5 beta done. For the AS 5 beta we need partial state transfer working; deploying web apps requires it. So the question is, do we restore the old RPC-based partial state transfer that got stripped out of 2.0 somewhere along the line, or do we go with the FLUSH based one that has a known problem. It's a beta, so I don't think having a known problem is the end of the world. But if the problem is so bad that's it's going to occur very frequently, we need to consider restoring the RPC approach for now.
It is a race condition. Problem will manifest if we have many cluster nodes (4+) and each of these nodes does concurrent activation i.e partial state transfer. I'll work more on this over the weekend I will update you with details.
OTH, it would not be hard to restore RPC approach. We had all the test working until very recently (Oct 12,2006). Not more than a 2-3 days of coding and testing to restore it.
Thanks. 2-3 days effort is quite a bit considering we'll turn around and throw it away a few weeks later, so if we can avoid it that's definitely preferable.
Here's a scenario / potential test:
Create a cache with say, 15 regions. Put some data in each of the regions.
2 threads. Each creates a cache then goes into a loop where it activates the 15 regions, with a 1 sec pause between activations.
Start one of the threads, wait 10 secs then start the other. See if both threads complete successfully.
That simulates a 10 sec staggered start of 2 servers in a cluster, with each server then deploying webapps. If that test can pass more than 90% of the time, I think it's fine for the initial beta.
Thanks Brian. I will make a unit test as you suggest - something similar to current concurrent test.
Regarding restoring the RPC based mechanism, we'd also need to roll back the queueing code in the marshalling Regions which I removed.
I'm with Brian in that I'd prefer not to roll back to the RPC mechanism.
I wrote the test that Brian suggested and it is passing. I am in finalization state of integrating this code. I am veryfing the effect of having FLUSH in replSyncService.xml and replAsyncService.xml on other tests.