I have 2 infinispan 8.2.8.Final servers and they are setup to talk to each other through the RELAY2 and backup caches as described in the Cross-site documentation - Client-server mode: https://access.redhat.com/documentation/en-US/Red_Hat_JBoss_Data_Grid/7.0/html/Administration_and_Configuration_Guide/sect-Configure_Cross-Datacenter_Replication.html#Configure_Cross-Datacentre_Replication
I have SYNC backup configured on my infinispan caches. I have a simple Java application, which connects to the infinispan server through the hotrod (RemoteCache). I am seeing the deadlock when there is an attempt to write record to the same key "123" on both sites concurrently.
I am attaching the files with thread dumps from both servers. I can also attach simple app if needed.
If I am analyzing the thread-dump correctly, I see that what happened is:
- Site1 transaction1: cache.put("123", val)
- Site1 transaction1: lockManager.lock("123", ...) called from AbstractLockingInterceptor. Acquired "site1-lock".
- Site1 transaction1: BackupSender.backupWrite called for "123" to Site2
Concurrently with it, I have on site2:
- Site2 transaction2: cache.put("123", val);
- Site2 transaction2: lockManager.lock("123", ...) called from AbstractLockingInterceptor. Acquired "site2-lock".
- Site2 transaction2: BackupSender.backupWrite called for "123" Site1
- In the meantime, Site2 received backup from Site1 (triggered by Site1 transaction1). But BaseBackupReceiver on site2 needs to wait for Site2 transaction2, for the site2-lock, so cannot continue. But site1 transaction1 is waiting for the response from BaseBackupReceiver, so cannot continue.
- In the meantime, Site1 received backup from Site2 (triggered by Site2 transaction2). But BaseBackupReceiver on site1 needs to wait for Site1 transaction1, for the site1-lock, so cannot continue. But site2 transaction2 is waiting for the response from BaseBackupReceiver, so cannot continue.
So we have nice deadlock here
- Does BackupSender really needs to have acquired lock for "123" at the time when it sends the backup to the remote site? I understand why lock is needed during write to the "local" cluster cache. But BackupSender doesn't need to be within the lock IMO. Or am I misunderstand something? Isn't this a bug?
- I am aware of the workaround to use ASYNC backup, but I don't want to do that for now. Because I need that when I call: remoteCache.put on site1, then remoteCache.get on site2 immediately see the record. And this doesn't work with ASYNC backup.
- Isn't it some other workaround besides ASYNC backup? I am seeing this behaviour for both transaction/non-transaction caches.
Is it more info needed? I can attach full clustered.xml files and simple app if needed.