Cache#remove() not behaving correctly under high load - while primary owner node is stopped
c_lohmn Mar 15, 2015 12:31 PMIn our application we had the problem that under high load a Cache#remove(key) invocation returned the old value although the same key was already removed on another node a few milliseconds before.
This occurred while another node was shutdown - that node happened to be the coordinator and also the primary owner of the key here.
The cache is a distributed, DB-persisted, transactional (locking mode Pessimistic) cache with numOwners=1.
This behaviour caused trouble because our application logic at that place relies very much on the fact that only one Cache#remove(key) invocation returns the old value here.
(We still use ISPN 6.0.2.)
I could reproduce such an issue with the attached unit test (tested on ISPN 7.1.2).
Scenario:
- MultipleCacheManagersTest with 3 nodes
- Cache is distributed, persisted, transactional (locking mode Pessimistic) with numOwners 1 or 2 (see below; getTestCacheNumOwners() in the test)
- a number of test keys get inserted; only keys with the coordinator as primary owner get used for the test
- while the keys get removed one-by-one and concurrently by 2 threads...
- ... the coordinator node gets stopped
At the end, the return values of the remove() invocations get checked.
remove() invocations that threw an error (SuspectException) are ignored.
I get the following problems:
- the remove(key) invocations on both nodes returned null and the entry was still there in the end
=> I would have expected an exception here as remove(key) essentially failed - the remove(key) invocations on both nodes returned null but the entry was actually removed
=> wrong return value of first invocation - the remove(key) invocations on both nodes returned both the old value
=> wrong return value of second invocation
These issues occur not always but quite often with the test.
When testing with numOwners=2, I only got the last of the above issues - in the test with numOwners=1 that occurred only rarely.
Could someone have a look here and see what the underlying problem might be?
Otherwise I'd create a ticket.
Cheers,
Carsten
PS:
Here is some log output:
###### numOwners = 1 ######
Issue 1: remove() of key 'k4' did not succeed, but we got no exception (entry still there; value: 'v')
2015-03-15 15:03:22,285 INFO (RemovalThread-main-NodeB-11765) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeB-11765: locality of key 'k4': NOT_LOCAL
2015-03-15 15:03:22,285 INFO (RemovalThread-main-NodeC-45760) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeC-45760: locality of key 'k4': NOT_LOCAL
2015-03-15 15:03:22,287 WARN (main) [TransactionTable] ISPN000100: Stopping, but there are 0 local transactions and 1 remote transactions that did not finish in time.
2015-03-15 15:03:22,288 WARN (main) [PreferAvailabilityStrategy] ISPN000312: Cache ___defaultcache lost data because of graceful leaver main-NodeA-49168
2015-03-15 15:03:22,289 INFO (RemovalThread-main-NodeB-11765) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: false
2015-03-15 15:03:22,290 INFO (RemovalThread-main-NodeC-45760) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: false
Issue 2: remove() did actually remove the key 'k4', but the return value was null
2015-03-15 15:04:37,493 INFO (RemovalThread-main-NodeB-47460) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeB-47460: locality of key 'k4': NOT_LOCAL
2015-03-15 15:04:37,493 INFO (RemovalThread-main-NodeC-16252) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeC-16252: locality of key 'k4': NOT_LOCAL
2015-03-15 15:04:37,495 WARN (main) [TransactionTable] ISPN000100: Stopping, but there are 0 local transactions and 1 remote transactions that did not finish in time.
2015-03-15 15:04:37,496 WARN (main) [PreferAvailabilityStrategy] ISPN000312: Cache ___defaultcache lost data because of graceful leaver main-NodeA-15814
2015-03-15 15:04:37,497 INFO (RemovalThread-main-NodeB-47460) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: false
2015-03-15 15:04:37,499 INFO (main) [JGroupsTransport] ISPN000080: Disconnecting JGroups channel ISPN
2015-03-15 15:04:37,503 INFO (RemovalThread-main-NodeC-16252) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: false
Issue 3: remove() invocations returned the old value for key 'k4' (occurred rarely)
2015-03-15 15:12:07,370 INFO (RemovalThread-main-NodeC-54601) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeC-54601: locality of key 'k4': NOT_LOCAL
2015-03-15 15:12:07,371 INFO (RemovalThread-main-NodeB-26306) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeB-26306: locality of key 'k4': NOT_LOCAL
2015-03-15 15:12:07,372 WARN (main) [TransactionTable] ISPN000100: Stopping, but there are 0 local transactions and 1 remote transactions that did not finish in time.
2015-03-15 15:12:07,372 WARN (main) [PreferAvailabilityStrategy] ISPN000312: Cache ___defaultcache lost data because of graceful leaver main-NodeA-30343
2015-03-15 15:12:07,377 INFO (main) [JGroupsTransport] ISPN000080: Disconnecting JGroups channel ISPN
2015-03-15 15:12:07,385 INFO (RemovalThread-main-NodeB-26306) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: true
2015-03-15 15:12:07,386 INFO (RemovalThread-main-NodeC-54601) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: true
###### numOwners = 2 ######
Issue 3: 2 remove() invocations returned the old value for key 'k4' (occurred more often than in numOwners=1 case)
2015-03-15 15:21:29,772 INFO (RemovalThread-main-NodeC-23833) [RemoveEntryWhilePrimaryOwnerStopsTest] main-NodeC-23833: locality of key 'k4': NOT_LOCAL
2015-03-15 15:21:29,772 INFO (RemovalThread-main-NodeB-35168) [RemoveEntryWhilePrimaryOwnerStopsTest] main-NodeB-35168: locality of key 'k4': LOCAL
2015-03-15 15:21:29,773 INFO (main) [RemoveEntryWhilePrimaryOwnerStopsTest] ######### Shutting down chosen node main-NodeA-27798... ###########
2015-03-15 15:21:29,780 INFO (RemovalThread-main-NodeC-23833) [RemoveEntryWhilePrimaryOwnerStopsTest] Remove() of key 'k4'; non-null value returned: true
2015-03-15 15:21:29,781 INFO (RemovalThread-main-NodeB-35168) [RemoveEntryWhilePrimaryOwnerStopsTest] Remove() of key 'k4'; non-null value returned: true