3 Replies Latest reply on Oct 18, 2017 3:26 PM by mposolda

    Cross-site setup: Does BackupSender needs to send backup with the acquired lock?




      I have 2 infinispan 8.2.8.Final servers and they are setup to talk to each other through the RELAY2 and backup caches as described in the Cross-site documentation - Client-server mode: https://access.redhat.com/documentation/en-US/Red_Hat_JBoss_Data_Grid/7.0/html/Administration_and_Configuration_Guide/sect-Configure_Cross-Datacenter_Replication.html#Configure_Cross-Datacentre_Replication


      I have SYNC backup configured on my infinispan caches. I have a simple Java application, which connects to the infinispan server through the hotrod (RemoteCache). I am seeing the deadlock when there is an attempt to write record to the same key "123" on both sites concurrently.


      I am attaching the files with thread dumps from both servers. I can also attach simple app if needed.


      If I am analyzing the thread-dump correctly, I see that what happened is:


      - Site1 transaction1:  cache.put("123", val)

      - Site1 transaction1: lockManager.lock("123", ...) called from AbstractLockingInterceptor. Acquired "site1-lock".

      - Site1 transaction1: BackupSender.backupWrite called for "123" to Site2


      Concurrently with it, I have on site2:


      - Site2 transaction2: cache.put("123", val);

      - Site2 transaction2: lockManager.lock("123", ...) called from AbstractLockingInterceptor. Acquired "site2-lock".

      - Site2 transaction2: BackupSender.backupWrite called for "123" Site1



      - In the meantime, Site2 received backup from Site1 (triggered by Site1 transaction1). But BaseBackupReceiver on site2 needs to wait for Site2 transaction2, for the site2-lock, so cannot continue. But site1 transaction1 is waiting for the response from BaseBackupReceiver, so cannot continue.

      - In the meantime, Site1 received backup from Site2 (triggered by Site2 transaction2). But BaseBackupReceiver on site1 needs to wait for Site1 transaction1, for the site1-lock, so cannot continue. But site2 transaction2 is waiting for the response from BaseBackupReceiver, so cannot continue.


      So we have nice deadlock here


      Few questions:

      - Does BackupSender really needs to have acquired lock for "123" at the time when it sends the backup to the remote site? I understand why lock is needed during write to the "local" cluster cache. But BackupSender doesn't need to be within the lock IMO. Or am I misunderstand something? Isn't this a bug?


      - I am aware of the workaround to use ASYNC backup, but I don't want to do that for now. Because I need that when I call: remoteCache.put on site1, then remoteCache.get on site2 immediately see the record. And this doesn't work with ASYNC backup.


      - Isn't it some other workaround besides ASYNC backup? I am seeing this behaviour for both transaction/non-transaction caches.


      Is it more info needed? I can attach full clustered.xml files and simple app if needed.



        • 1. Re: Cross-site setup: Does BackupSender needs to send backup with the acquired lock?

          The lock needs to be acquired during the x-site backup, because otherwise two concurrent writes on the same site could get reordered on the other side. With async, you acquire just sequence id in JGroups (I think) within the lock and then you are safe.

          Theoretically it would be possible to send the message and wait for response *after* releasing the lock, but that would require complex changes. And more importantly, you'd end up with different values on each site; the value coming from the other side.


          Your use case is broken by design, sorry. If you want to get the same order of writes on both sides one side has to do the write first.


          There's one workaround that could be acceptable, though: We have a flag ZERO_ACQUISITION_TIMEOUT which fails the operation if it cannot acquire the lock immediately. You'd see spurious failures, but you'd not get the deadlock and each side would keep "its" value. If you're not afraid of going past the API (risking incompatibility over versions etc.) you could override o.i.xsite.BackupReceiver.handleRemoteCommand to add this flag only on the remote side; that way you'd avoid failing due to local concurrent access. And you could BackupFailurePolicy so that you ignore these failures.


          No guarantees.

          1 of 1 people found this helpful
          • 2. Re: Cross-site setup: Does BackupSender needs to send backup with the acquired lock?

            Thanks Radim! I have working prototype based on your suggestion. On BackupReceiverSide I have ZERO_LOCK_ACQUISITION_TIMEOUT, but I have BackupPolicy FAIL . So BackupSender is notified that backup failed and throws the exception, which I can catch on the application side and retry the cache operation. This gives me both consistency and no deadlocks.


            Besides the fact that I need to supress ERROR logging by InvocationContextInterceptor (even if I catch the CacheException on the application side). Will be nice to have some flag similar to FAIL_SILENTLY, which will tell that I want to propagate the CacheException to the application, but skip logging of ERROR messages by infinispan. FAIL_SILENTLY skips logging, but doesn't propagate the exception to the application, which doesn't work for me. But that's another issue..


            Any chance that infinispan provides better OOTB support for the case when you have 2 sites connected just through the SYNC backup cache and both are trying to concurrently update same key? Should I create JIRA for this? TBH the case when all the parties are locked and waiting for each other (defacto deadlock) looks to me like a bug, which infinispan should handle better. Or is it just me?

            • 3. Re: Cross-site setup: Does BackupSender needs to send backup with the acquired lock?

              FYI: I've created JIRA in JDG project to support this case better: [JDG-1318] Deadlock in the cross-site setup with SYNC backups - JBoss Issue Tracker