We are moving internal thread discussion to the forum so everyone can contribute and discuss.
While clearing up remaining unit test failures and getting ready for 2.0 GA release we noticed transient StateTransferConcurrencyTest#testConcurrentUseSync failure. This discovery has implications beyond this test and concerns concurrent region activation (state transfer) and high invocation load. testConcurrentUseSync does bunch of synchronous cache invocations on five cache instances while one of the caches activates regions on itself.
Lets say we have five cache instances A,B,C,D, and E. A is the instances that does region.activate() for regions under which B,C,D, and E do synchronous put invocations. These invocations and region.activate() are concurrent. Here is what happens. At the moment FLUSH is started at A (in order to do state transfer triggered by region.activate) one of the members (B,C,D, and/or E) invokes cache.put. Lets say D. This cache put invocation, say CP, coming from D goes down D's stack before channel gets blocked at D. CP arrives at A and goes up the stack at A. At that moment FLUSH proceeded and is already blocking down on all channels. Invocation response for CP never returns from A since channels are blocked. After careful observation we have concluded that until CP returns from A (response blocked down due to flush) any subsequent mcast messages from D will not arrive at FLUSH protocol level at A. This is not a JGroups bug but a valid property related to FIFO message delivery. This unfortunately includes STOP_FLUSH_OK messages and thus FLUSH cannot complete gracefully at A until CP returns. Finally, we run into timeouts, test starts barfing TimeoutExceptions and fails.
We have to augment FLUSH slightly. Have a look at the semantics of FLUSH in the link below. Currently we have one semaphore B (depicted in diagram as "wait on FLUSH.down()") that is activated once each channel receives all FLUSH_OK messages from all channels. We will introduce another semaphore A that will block down *only* non JGroups threads i.e application threads. Here are the details:
Semaphore A: When each channel gets START_FLUSH message do not allow user/application threads to call channel.down(). Upon switching on semaphore A, JGroups thread that percolated up START_FLUSH, travels up to application level and carries BLOCK even/callback. This JGroups thread then can do any necessary cleanup or whatever and can even send messages back down the stack because semaphore B has not been activated yet.
Semaphore B: keeps current semantics. When each channel finishes first round of FLUSH (FLUSH_OK) do not allow any threads to call channel.down()
So how does this work for the problem above? FLUSH_OK round is strictly guaranteed to be sent after any applications messages so all synchronous calls have to unwind and return because FLUSH_OK will flush them. In the above problem description CP is guaranteed to unwind. Solution we had before relied on a good will of application that it will not send any more messages after it receives BLOCK event. If application disobeyed like JBC did we have a race condition of application message and FLUSH_OK. If application message was sent after FLUSH_OK we get the problem from above.
As a side note, we could have solved this in JBC but having JBC follow FLUSH BLOCK semantics - however that solution looked prohibitively expensive performance wise.
Currently we have only semaphore B. We originally had B in the place of A but have since moved it to B so we can have a BLOCK mechanism notification that allows solution of JBCACHE-315. This proposal with two semaphores will still leave room to solve JBCACHE-315 because BLOCK event travels up the stack on JGroups thread when START_FLUSH is received. So any work has to be done on that thread and channel can potentially send messages down the stack before semaphore B kicks in.
Brian had concerns about implications of semaphore A on a solution for JBCACHE-315 and rightly so. JBCACHE-315 solution involves the following:
1) JGroups thread that percolated up START_FLUSH, travels up
to application level and invokes block() callback.
2) JBC block() impl involves
a) setting some flag/latch to prevent new tx's accessing the cache and thus causing unreplicated state changes. (We should deal with non-tx write calls as well.)
b) Monitoring existing tx's, giving them a chance to complete.
c) Rolling back tx's that don't complete.
2b and 2c involve letting application threads send messages (PREPARE/COMMIT/ROLLBACK RPCs). If JGroups is going to be blocking those messages on semaphore A, we're stuck.
So we have to also think of a solution for JBCACHE-315 concurrently so to speak. Maybe the algorithm for JBCACHE-315 can be simplified somehow. What if we simply turn the latch on to prevent new txs accessing cache and send rollback message on JGroups thread for all txs that are in progress? The cost is more rollbacks but we get strict consistency and simplicity.
Lets hear your suggestions!