11 Replies Latest reply on Jan 9, 2013 5:39 AM by borges

    A deadlock found in replication

    gaohoward

      A deadlock just found in my last test run (which hangs because of it). Here is a piece of stack showing it:

       

      Found one Java-level deadlock:

      =============================

      "Thread-1693":

        waiting for ownable synchronizer 0x00000007e55b47a0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),

        which is held by "Thread-28 (HornetQ-server-HornetQServerImpl::ReplicatedPagedFailoverTest/liveServer-623058272)"

      "Thread-28 (HornetQ-server-HornetQServerImpl::ReplicatedPagedFailoverTest/liveServer-623058272)":

        waiting for ownable synchronizer 0x00000007e4cd3868, (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),

        which is held by "Thread-1693"

      Java stack information for the threads listed above:

      ===================================================

      "Thread-1693":

          at sun.misc.Unsafe.park(Native Method)

          - parking to wait for  <0x00000007e55b47a0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)

          at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

          at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)

          at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)

          at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)

          at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)

          at org.hornetq.core.paging.impl.PagingStoreImpl.openNewPage(PagingStoreImpl.java:1080)

          at org.hornetq.core.paging.impl.PagingStoreImpl.forceAnotherPage(PagingStoreImpl.java:615)

          at org.hornetq.core.persistence.impl.journal.JournalStorageManager.getPageInformationForSync(JournalStorageManager.java:528)

          at org.hornetq.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:422)

          at org.hornetq.core.server.impl.HornetQServerImpl$6.run(HornetQServerImpl.java:2694)

          at java.lang.Thread.run(Thread.java:662)

      "Thread-28 (HornetQ-server-HornetQServerImpl::ReplicatedPagedFailoverTest/liveServer-623058272)":

          at sun.misc.Unsafe.park(Native Method)

          - parking to wait for  <0x00000007e4cd3868> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)

          at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

          at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)

          at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:941)

          at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1261)

          at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594)

          at org.hornetq.core.persistence.impl.journal.JournalStorageManager.readLock(JournalStorageManager.java:882)

          at org.hornetq.core.persistence.impl.journal.JournalStorageManager.deleteIncrementRecord(JournalStorageManager.java:2085)

          at org.hornetq.core.paging.cursor.impl.PageSubscriptionCounterImpl.delete(PageSubscriptionCounterImpl.java:185)

          - locked <0x00000007e55b5138> (a org.hornetq.core.paging.cursor.impl.PageSubscriptionCounterImpl)

          at org.hornetq.core.paging.cursor.impl.PageSubscriptionImpl.onPageModeCleared(PageSubscriptionImpl.java:255)

          at org.hornetq.core.paging.cursor.impl.PageCursorProviderImpl.onPageModeCleared(PageCursorProviderImpl.java:357)

          at org.hornetq.core.paging.impl.PagingStoreImpl.stopPaging(PagingStoreImpl.java:507)

          at org.hornetq.core.paging.cursor.impl.PageCursorProviderImpl.cleanup(PageCursorProviderImpl.java:460)

          - locked <0x00000007e55b48a0> (a org.hornetq.core.paging.cursor.impl.PageCursorProviderImpl)

          at org.hornetq.core.paging.cursor.impl.PageCursorProviderImpl$1.run(PageCursorProviderImpl.java:331)

          at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:98)

          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

          at java.lang.Thread.run(Thread.java:662)

      Found 1 deadlock.

      Heap

      PSYoungGen      total 576640K, used 497305K [0x00000007d6c00000, 0x00000007ffba0000, 0x0000000800000000)

        eden space 543744K, 85% used [0x00000007d6c00000,0x00000007f31925f8,0x00000007f7f00000)

        from space 32896K, 99% used [0x00000007f7f00000,0x00000007f9f13f20,0x00000007f9f20000)

        to   space 63296K, 0% used [0x00000007fbdd0000,0x00000007fbdd0000,0x00000007ffba0000)

      PSOldGen        total 220096K, used 72461K [0x0000000784400000, 0x0000000791af0000, 0x00000007d6c00000)

        object space 220096K, 32% used [0x0000000784400000,0x0000000788ac3520,0x0000000791af0000)

      PSPermGen       total 25152K, used 25075K [0x000000077f200000, 0x0000000780a90000, 0x0000000784400000)

        object space 25152K, 99% used [0x000000077f200000,0x0000000780a7ced0,0x0000000780a90000)

       

      Idea?

        • 1. Re: A deadlock found in replication
          gaohoward

          Ideally we should have a way to rearrange the two threads to avoid the deadlock. But as far as my understanding of the code goes, there isn't an easy way to do it. Instead, until someone comes up a better idea, I have a general way to solve it. Here it is.

           

          As the deadlock happens while a thread is doing server replication, which is a server-wise task, and the other thread is just doing a page cleanup work that is assumed to be a regular task, we can create a map for holding various kinds of critical tasks like replication. The thread who is doing a critical task will get the write lock of it, all other threads who may cause deadlock with the critical thread need to grab the read lock.

           

          So we can have a class like:

           

           

          public class CriticalEvent
          {
             public static final Byte START_REPLICATION = 0x01;
             
             private ReadWriteLock lock = new ReentrantReadWriteLock();
             
             public static <T> T performRegular(CriticalEvent event, ServerTask<T> task) throws Exception
             {
                try
                {
                   event.lock.readLock().lock();
                   return task.run();
                }
                finally
                {
                   event.lock.readLock().unlock();
                }
             }
             
             public static <T> T performCritical(CriticalEvent event, ServerTask<T> task) throws Exception
             {
                try
                {
                   event.lock.writeLock().lock();
                   return task.run();
                }
                finally
                {
                   event.lock.writeLock().unlock();
                }
             }
             
             public static interface ServerTask<T>
             {
                public T run() throws Exception;
             }
          }
          

           

          Then we put a map in the HornetQServerImpl class:

           

           

             private final Map<Byte, CriticalEvent> criticalEvents = new HashMap<Byte, CriticalEvent>();
          

           

          Change the thread code doing replication as:

           

           

          CriticalEvent event = criticalEvents.get(CriticalEvent.START_REPLICATION);
          CriticalEvent.performCritical(event, new CriticalEvent.ServerTask<Object>()
             {
                public Object run() throws Exception
                {
                    storageManager.startReplication(replicationManager, pagingManager, getNodeID().toString(),
                                isFailBackRequest && configuration.isAllowAutoFailBack());
                    return null;
                }
             });
          

           

          and the other as:

           

           

          CriticalEvent e = storageManager.getServer().getCriticalEvent(CriticalEvent.START_REPLICATION);
          CriticalEvent.performRegular(e, new ServerTask<Object>()
             {
                public Object run() throws Exception
                {
                   cleanup();
                   return null;
                }
             });
          

           

          That'll make the two threads exclusive.

           

          Comments?

          • 2. Re: A deadlock found in replication
            gaohoward

            created a Jira to track this:

             

            https://issues.jboss.org/browse/HORNETQ-1102

            • 3. Re: A deadlock found in replication
              borges

              Hi,

               

              While I still do not have a solution for it. I honestly find your proposal too complex, it creates a whole new infraestructure with interfaces and all.

               

              I would rather spend some time trying to find a simpler solution, which I reckon we may fail to do.

               

              [...]

               

              The core of the problem we have is that we have different code paths taking two locks in different order. I think the correct way to approuch this is to ensure that the locks are always taken in the right order.

               

              For what is worth, this is (somewhat) the same problem that already gave us trouble and was addressed in JournalStorageManager.addToPage(...) (line ~4220).

              1 of 1 people found this helpful
              • 4. Re: A deadlock found in replication
                gaohoward

                I agree that it is a somplex solution and should have a better one than that. I'll come back to this when I have a better understanding of the code.

                 

                Thanks

                • 5. Re: A deadlock found in replication
                  clebert.suconic

                  There are two locks.. one from StorageManager, one from ReplicationManager and one from the PagingStore

                   

                   

                  The PagingStore one has to stay as it's used to protect depage and paging.... But the one from StorageManager / ReplicationManager could be a simple lock called IOLock, and the same ReadWriteLock could be passed as a parameter maybe?

                   

                   

                  On the PagingStore, during write it should always get the IOLock as a readOnly, that would prevent the deadlock.

                   

                   

                   

                  I would need to do some refactoring to that.. If Francisco or Howard are not doing it before me... , otherwise I will see if i can get some extra strength to do it this year still

                  • 6. Re: A deadlock found in replication
                    gaohoward

                    Hi Clebert if you have time to take this one I think that's best. I'm not quite understand this part of code so I think the safest thing to do is pass it on to you.

                    • 7. Re: A deadlock found in replication
                      clebert.suconic

                      This is a dead lock between startSynchronization and cleanup getting out of page mode

                       

                      and there's one thing I don't understand...

                       

                       

                      StartSynchronization is holding these locks:

                       

                      storageManager.lock();

                           bindings.lock();

                           messageJounral.lock();

                               paingManager.lock();

                                     trying to get a pageStore.lock();

                       

                       

                      While the cleanup:

                           pagingManager.lock();

                               ...   that means... the pagingManager lock is already taken here.

                       

                       

                       

                       

                      So, I think the issue here is about stopping / starting the servers too fast on the testsuite leading to this issue. We should either guarantee that these components are final and not removed, or make sure that the server is really stopped before starting it again.

                       

                      Unless I am missing something there?

                       

                      @Howard: do you remember the test that dead locked?

                      • 8. Re: A deadlock found in replication
                        gaohoward

                        I don't write down the test at the moment of hang. But I think it's ReplicatedPagedFailoverTest as it's in the stack thread name.

                         

                        I attached the full thread dump fyi.

                         

                        Howard

                        • 9. Re: A deadlock found in replication
                          borges

                          Regardless of start/stop calls to the server, there are 2 locks being taken in different order by 2 methods. While the likelihood of a dead-lock in production is low, the race is pretty straightforward.

                           

                          I just pushed a commit (currently only at https://github.com/FranciscoBorges/hornetq) that fixes this. It is really just a matter of ensuring that these related locks are always taken in the right order.

                           

                          The (actual) problem posed by this case, is reviewing /all/ calls from the paging system to the JournalStorageManager and making sure we do not have other instances _locks taken in the wrong order_.

                          • 10. Re: A deadlock found in replication
                            clebert.suconic

                            The stop is resetting the pageManager to null. A further start will create a new pageManager and it's possible the system will have different locks. I analyzed the thread dump and there is no way the system would get into that state with a single lock. Both Cleanup and startReplicate are holding the PageManager.lock() what would be impossible, unless one of the threads has a different instance of the lock.

                             

                             

                            So, maybe you found another issue beyond that point (what's good actually), but there's certainly an issue with setting these variables to null. We should review if we still want to do that. (Not the pageManager at least.)

                            • 11. Re: A deadlock found in replication
                              borges

                              FWIW, this dead lock and another much like it were fixed in this commit https://github.com/hornetq/hornetq/commit/c6e4cae5e008028e4cbc488eb43943fa41f3168e