2 Replies Latest reply on Mar 17, 2015 9:40 AM by c_lohmn

    Cache#remove() not behaving correctly under high load - while primary owner node is stopped

    c_lohmn

      In our application we had the problem that under high load a Cache#remove(key) invocation returned the old value although the same key was already removed on another node a few milliseconds before.

      This occurred while another node was shutdown - that node happened to be the coordinator and also the primary owner of the key here.

      The cache is a distributed, DB-persisted, transactional (locking mode Pessimistic) cache with numOwners=1.

       

      This behaviour caused trouble because our application logic at that place relies very much on the fact that only one Cache#remove(key) invocation returns the old value here.

      (We still use ISPN 6.0.2.)

       

       

      I could reproduce such an issue with the attached unit test (tested on ISPN 7.1.2).

      Scenario:

      • MultipleCacheManagersTest with 3 nodes
      • Cache is distributed, persisted, transactional (locking mode Pessimistic) with numOwners 1 or 2 (see below; getTestCacheNumOwners() in the test)
      • a number of test keys get inserted; only keys with the coordinator as primary owner get used for the test
      • while the keys get removed one-by-one and concurrently by 2 threads...
      • ... the coordinator node gets stopped

       

      At the end, the return values of the remove() invocations get checked.

      remove() invocations that threw an error (SuspectException) are ignored.

       

      I get the following problems:

      1. the remove(key) invocations on both nodes returned null and the entry was still there in the end
        => I would have expected an exception here as remove(key) essentially failed
      2. the remove(key) invocations on both nodes returned null but the entry was actually removed
        => wrong return value of first invocation
      3. the remove(key) invocations on both nodes returned both the old value
        => wrong return value of second invocation

       

      These issues occur not always but quite often with the test.

      When testing with numOwners=2, I only got the last of the above issues - in the test with numOwners=1 that occurred only rarely.

       

      Could someone have a look here and see what the underlying problem might be?

      Otherwise I'd create a ticket.

       

      Cheers,

      Carsten

       

       

       

      PS:

      Here is some log output:

       

      ###### numOwners = 1 ######

      Issue 1: remove() of key 'k4' did not succeed, but we got no exception (entry still there; value: 'v')

      2015-03-15 15:03:22,285 INFO  (RemovalThread-main-NodeB-11765) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeB-11765: locality of key 'k4': NOT_LOCAL

      2015-03-15 15:03:22,285 INFO  (RemovalThread-main-NodeC-45760) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeC-45760: locality of key 'k4': NOT_LOCAL

      2015-03-15 15:03:22,287 WARN  (main) [TransactionTable] ISPN000100: Stopping, but there are 0 local transactions and 1 remote transactions that did not finish in time.

      2015-03-15 15:03:22,288 WARN  (main) [PreferAvailabilityStrategy] ISPN000312: Cache ___defaultcache lost data because of graceful leaver main-NodeA-49168

      2015-03-15 15:03:22,289 INFO  (RemovalThread-main-NodeB-11765) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: false

      2015-03-15 15:03:22,290 INFO  (RemovalThread-main-NodeC-45760) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: false

       

       

      Issue 2: remove() did actually remove the key 'k4', but the return value was null

      2015-03-15 15:04:37,493 INFO  (RemovalThread-main-NodeB-47460) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeB-47460: locality of key 'k4': NOT_LOCAL

      2015-03-15 15:04:37,493 INFO  (RemovalThread-main-NodeC-16252) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeC-16252: locality of key 'k4': NOT_LOCAL

      2015-03-15 15:04:37,495 WARN  (main) [TransactionTable] ISPN000100: Stopping, but there are 0 local transactions and 1 remote transactions that did not finish in time.

      2015-03-15 15:04:37,496 WARN  (main) [PreferAvailabilityStrategy] ISPN000312: Cache ___defaultcache lost data because of graceful leaver main-NodeA-15814

      2015-03-15 15:04:37,497 INFO  (RemovalThread-main-NodeB-47460) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: false

      2015-03-15 15:04:37,499 INFO  (main) [JGroupsTransport] ISPN000080: Disconnecting JGroups channel ISPN

      2015-03-15 15:04:37,503 INFO  (RemovalThread-main-NodeC-16252) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: false

       

      Issue 3: remove() invocations returned the old value for key 'k4'     (occurred rarely)

      2015-03-15 15:12:07,370 INFO  (RemovalThread-main-NodeC-54601) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeC-54601: locality of key 'k4': NOT_LOCAL

      2015-03-15 15:12:07,371 INFO  (RemovalThread-main-NodeB-26306) [RemoveEntryWhileSingleOwnerStopsTest] main-NodeB-26306: locality of key 'k4': NOT_LOCAL

      2015-03-15 15:12:07,372 WARN  (main) [TransactionTable] ISPN000100: Stopping, but there are 0 local transactions and 1 remote transactions that did not finish in time.

      2015-03-15 15:12:07,372 WARN  (main) [PreferAvailabilityStrategy] ISPN000312: Cache ___defaultcache lost data because of graceful leaver main-NodeA-30343

      2015-03-15 15:12:07,377 INFO  (main) [JGroupsTransport] ISPN000080: Disconnecting JGroups channel ISPN

      2015-03-15 15:12:07,385 INFO  (RemovalThread-main-NodeB-26306) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: true

      2015-03-15 15:12:07,386 INFO  (RemovalThread-main-NodeC-54601) [RemoveEntryWhileSingleOwnerStopsTest] Remove() of key 'k4'; non-null value returned: true

       

       

      ###### numOwners = 2 ######

      Issue 3: 2 remove() invocations returned the old value for key 'k4'     (occurred more often than in numOwners=1 case)

      2015-03-15 15:21:29,772 INFO  (RemovalThread-main-NodeC-23833) [RemoveEntryWhilePrimaryOwnerStopsTest] main-NodeC-23833: locality of key 'k4': NOT_LOCAL

      2015-03-15 15:21:29,772 INFO  (RemovalThread-main-NodeB-35168) [RemoveEntryWhilePrimaryOwnerStopsTest] main-NodeB-35168: locality of key 'k4': LOCAL

      2015-03-15 15:21:29,773 INFO  (main) [RemoveEntryWhilePrimaryOwnerStopsTest] ######### Shutting down chosen node main-NodeA-27798... ###########

      2015-03-15 15:21:29,780 INFO  (RemovalThread-main-NodeC-23833) [RemoveEntryWhilePrimaryOwnerStopsTest] Remove() of key 'k4'; non-null value returned: true

      2015-03-15 15:21:29,781 INFO  (RemovalThread-main-NodeB-35168) [RemoveEntryWhilePrimaryOwnerStopsTest] Remove() of key 'k4'; non-null value returned: true