I recently encountered a very strange scenario when the data grid was being read and write in a very high frequency with multiple threads. The cache was setup in embedded and replication mode and all write to the cache are wrapped in transaction. Multiple values will be locked and updated in a single transaction. However, the retrieval of the values from the cache are not wrapped by a transaction. Three workers will perform the update logic (multiple get, lock and put) while two threads will perform a traversal operation to lookup subset of the values using the keys stored in local memory. When there are around 5K updates per second, the workers who performed the lookup of a particular key may sometime return null. The key which encountered lookup failure may vary and it seems not related to data pattern of the key. No data grid exception has been captured indicating a commit failure or rollback being taken place.
I have added logic to get back the value from the cache immediately after the commit operation and they could retrieve the correct values. When the problem occurred, I have traversed the keySet and valueSet of the grid and found out such key was missed in the keySet and the value could be located in the valueSet. I can confirm that no cache.remove() has been called. I have also tried to write back the value into the grid using the key stored in local memory, subsequent cache.get(key) still return null. CacheEntryModifiedEvent was able to log the existent of the key and value. It seems that the key was disappeared after sometimes high load was being injected. The problem also occurred if the cluster was running in a single node.
I am now using 5.3.0-Final and the grid was configured as follows:
<locking useLockStriping="false" concurrencyLevel="10000" />
<storeAsBinary enabled="true" storeKeysAsBinary="false" storeValuesAsBinary="true" defensive="true" />
transactionMode="TRANSACTIONAL" lockingMode="OPTIMISTIC" useEagerLocking="true" useSynchronization="true"
syncCommitPhase="true" syncRollbackPhase="false" />
<deadlockDetection enabled="true" spinDuration="10000" />
Does anybody have any idea of the problem?