We have an issue regarding state transfer within our system. We basically assume that given two clustered caches A and B, cache A can be used and accessed safely while cache B is starting without and risk of data loss when cache B is successfully started and accessed. This, however, does not seem to be the case all the time.
You will find a test case here, with source and configuration files: http://www.cubeia.com/misc/statetransfer/src.zip
The test stresses the state transfer. The test class is documented, but inshort, it sets up:
- Two caches, but starts only the first.
- Fifty objects with a "counter", mapped by id within cache one.
- Fifty threads associated with one object each (by id) and cache one.
The test then goes like this:
- The threads are started. Each thread accesses its associated object, checks that the counter is correct (ie. can correctly be incremented by one without loosing intermediate states), increment the counter and repeats.
- Cache two is started.
- Half of the threads are re-associated with cache two instead of cache one, however, their execution is not halted.
The test fails on either JBoss Cache exceptions or 1) sequence errors (ie. lost intermediate states); or 2) missing state (ie. attempt to access an object in cache two which has not been replicated at all). We have tested the following setup in differend permutations: REPL_SYNC/REPL_ASYNCH, user transaction/no user transaction, and buddy replication enabled or disabled. So far our results looks somewhat like this (and they do match what we're seeing in our main system:
SYNCH + TRANS + NO BUDDY
Fail to replicate to cahce two with a replication exception caused by a suspetced member exception.
ASYNCH + TRANS + NO BUDDY
Sequence errors. Is this expected?
SYNCH + NO TRANS + NO BUDDY
ASYNCH + NO TRANS + NO BUDDY
Sequence errors. Is this expected? There's also "cache not in started state" errors on shutdown.
SYNCH + TRANS + BUDDY
Failure with time out exception. + A subsequent load of exceptions.
ASYNCH + NO TRANS + BUDDY
Sequence errors. Is this expected? Object not found (!). Also, depending on whether "loopback" is set in the jgroups stack or not you get slightly different behaviour. With loopback=true you get time out exceptions on buddy backup nodes. With loopback=false you get "cache not in started state" errors on shutdown.
ASYNCH + TRANS + BUDDY
(see above, adding a user transaction does not change the behaviour)
We're primarily interested in getting REPL_ASYNC to work with user transactions and buddy replication (however, we're aware that asynchronous replication might not work in this scenario so synchronous would be ok). So, a few initial questions:
1) Is our understanding of the cache correct (see first paragraph)? If not, what prerequisites are needed to safely start a new node in a cluster?
2) If our understanding is somewhat ok, is the test correct? Obviously I may very well have screwed up somewhere in the code :-)
If 1 && 2: Then the test seems to point out, at least, unexpected behaviour.
Also, I'm aware of that this might be taxing and time consuming questions to answer or indeed even verify - even if there's no issue involved - so please indicate if there's any support agreement or service which would enable us to proceed faster, better or at all :-) You can reach me by PM or emailing me at (my name as written below, first name, middle initial and surname in lower case without spaces but separated by dots)@cubeia.com.
Lars J. Nilsson