11 Replies Latest reply on Jan 9, 2013 5:39 AM by borges

A deadlock found in replication

gaohoward Dec 14, 2012 12:55 AM

A deadlock just found in my last test run (which hangs because of it). Here is a piece of stack showing it:

Found one Java-level deadlock:

=============================

"Thread-1693":

waiting for ownable synchronizer 0x00000007e55b47a0, (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),

which is held by "Thread-28 (HornetQ-server-HornetQServerImpl::ReplicatedPagedFailoverTest/liveServer-623058272)"

"Thread-28 (HornetQ-server-HornetQServerImpl::ReplicatedPagedFailoverTest/liveServer-623058272)":

waiting for ownable synchronizer 0x00000007e4cd3868, (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync),

which is held by "Thread-1693"

Java stack information for the threads listed above:

===================================================

"Thread-1693":

at sun.misc.Unsafe.park(Native Method)

- parking to wait for <0x00000007e55b47a0> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)

at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)

at org.hornetq.core.paging.impl.PagingStoreImpl.openNewPage(PagingStoreImpl.java:1080)

at org.hornetq.core.paging.impl.PagingStoreImpl.forceAnotherPage(PagingStoreImpl.java:615)

at org.hornetq.core.persistence.impl.journal.JournalStorageManager.getPageInformationForSync(JournalStorageManager.java:528)

at org.hornetq.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:422)

at org.hornetq.core.server.impl.HornetQServerImpl$6.run(HornetQServerImpl.java:2694)

at java.lang.Thread.run(Thread.java:662)

"Thread-28 (HornetQ-server-HornetQServerImpl::ReplicatedPagedFailoverTest/liveServer-623058272)":

at sun.misc.Unsafe.park(Native Method)

- parking to wait for <0x00000007e4cd3868> (a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)

at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:941)

at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1261)

at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594)

at org.hornetq.core.persistence.impl.journal.JournalStorageManager.readLock(JournalStorageManager.java:882)

at org.hornetq.core.persistence.impl.journal.JournalStorageManager.deleteIncrementRecord(JournalStorageManager.java:2085)

at org.hornetq.core.paging.cursor.impl.PageSubscriptionCounterImpl.delete(PageSubscriptionCounterImpl.java:185)

- locked <0x00000007e55b5138> (a org.hornetq.core.paging.cursor.impl.PageSubscriptionCounterImpl)

at org.hornetq.core.paging.cursor.impl.PageSubscriptionImpl.onPageModeCleared(PageSubscriptionImpl.java:255)

at org.hornetq.core.paging.cursor.impl.PageCursorProviderImpl.onPageModeCleared(PageCursorProviderImpl.java:357)

at org.hornetq.core.paging.impl.PagingStoreImpl.stopPaging(PagingStoreImpl.java:507)

at org.hornetq.core.paging.cursor.impl.PageCursorProviderImpl.cleanup(PageCursorProviderImpl.java:460)

- locked <0x00000007e55b48a0> (a org.hornetq.core.paging.cursor.impl.PageCursorProviderImpl)

at org.hornetq.core.paging.cursor.impl.PageCursorProviderImpl$1.run(PageCursorProviderImpl.java:331)

at org.hornetq.utils.OrderedExecutorFactory$OrderedExecutor$1.run(OrderedExecutorFactory.java:98)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

Found 1 deadlock.

Heap

PSYoungGen total 576640K, used 497305K [0x00000007d6c00000, 0x00000007ffba0000, 0x0000000800000000)

eden space 543744K, 85% used [0x00000007d6c00000,0x00000007f31925f8,0x00000007f7f00000)

from space 32896K, 99% used [0x00000007f7f00000,0x00000007f9f13f20,0x00000007f9f20000)

to space 63296K, 0% used [0x00000007fbdd0000,0x00000007fbdd0000,0x00000007ffba0000)

PSOldGen total 220096K, used 72461K [0x0000000784400000, 0x0000000791af0000, 0x00000007d6c00000)

object space 220096K, 32% used [0x0000000784400000,0x0000000788ac3520,0x0000000791af0000)

PSPermGen total 25152K, used 25075K [0x000000077f200000, 0x0000000780a90000, 0x0000000784400000)

object space 25152K, 99% used [0x000000077f200000,0x0000000780a7ced0,0x0000000780a90000)

Idea?

1. Re: A deadlock found in replication

gaohoward Dec 19, 2012 2:45 AM (in response to gaohoward)

Ideally we should have a way to rearrange the two threads to avoid the deadlock. But as far as my understanding of the code goes, there isn't an easy way to do it. Instead, until someone comes up a better idea, I have a general way to solve it. Here it is.

As the deadlock happens while a thread is doing server replication, which is a server-wise task, and the other thread is just doing a page cleanup work that is assumed to be a regular task, we can create a map for holding various kinds of critical tasks like replication. The thread who is doing a critical task will get the write lock of it, all other threads who may cause deadlock with the critical thread need to grab the read lock.

So we can have a class like:

public class CriticalEvent
{
   public static final Byte START_REPLICATION = 0x01;
   
   private ReadWriteLock lock = new ReentrantReadWriteLock();
   
   public static <T> T performRegular(CriticalEvent event, ServerTask<T> task) throws Exception
   {
      try
      {
         event.lock.readLock().lock();
         return task.run();
      }
      finally
      {
         event.lock.readLock().unlock();
      }
   }
   
   public static <T> T performCritical(CriticalEvent event, ServerTask<T> task) throws Exception
   {
      try
      {
         event.lock.writeLock().lock();
         return task.run();
      }
      finally
      {
         event.lock.writeLock().unlock();
      }
   }
   
   public static interface ServerTask<T>
   {
      public T run() throws Exception;
   }
}

Then we put a map in the HornetQServerImpl class:

   private final Map<Byte, CriticalEvent> criticalEvents = new HashMap<Byte, CriticalEvent>();

Change the thread code doing replication as:

CriticalEvent event = criticalEvents.get(CriticalEvent.START_REPLICATION);
CriticalEvent.performCritical(event, new CriticalEvent.ServerTask<Object>()
   {
      public Object run() throws Exception
      {
          storageManager.startReplication(replicationManager, pagingManager, getNodeID().toString(),
                      isFailBackRequest && configuration.isAllowAutoFailBack());
          return null;
      }
   });

and the other as:

CriticalEvent e = storageManager.getServer().getCriticalEvent(CriticalEvent.START_REPLICATION);
CriticalEvent.performRegular(e, new ServerTask<Object>()
   {
      public Object run() throws Exception
      {
         cleanup();
         return null;
      }
   });

That'll make the two threads exclusive.

Comments?

2. Re: A deadlock found in replication

gaohoward Dec 19, 2012 8:31 AM (in response to gaohoward)

created a Jira to track this:

https://issues.jboss.org/browse/HORNETQ-1102
Actions
3. Re: A deadlock found in replication

borges Dec 19, 2012 9:44 AM (in response to gaohoward)

Hi,

While I still do not have a solution for it. I honestly find your proposal too complex, it creates a whole new infraestructure with interfaces and all.

I would rather spend some time trying to find a simpler solution, which I reckon we may fail to do.

[...]

The core of the problem we have is that we have different code paths taking two locks in different order. I think the correct way to approuch this is to ensure that the locks are always taken in the right order.

For what is worth, this is (somewhat) the same problem that already gave us trouble and was addressed in JournalStorageManager.addToPage(...) (line ~4220).
1 of 1 people found this helpful
Actions
4. Re: A deadlock found in replication

gaohoward Dec 19, 2012 9:09 PM (in response to borges)

I agree that it is a somplex solution and should have a better one than that. I'll come back to this when I have a better understanding of the code.

Thanks
Actions
5. Re: A deadlock found in replication

clebert.suconic Dec 19, 2012 10:06 PM (in response to gaohoward)

There are two locks.. one from StorageManager, one from ReplicationManager and one from the PagingStore

The PagingStore one has to stay as it's used to protect depage and paging.... But the one from StorageManager / ReplicationManager could be a simple lock called IOLock, and the same ReadWriteLock could be passed as a parameter maybe?

On the PagingStore, during write it should always get the IOLock as a readOnly, that would prevent the deadlock.

I would need to do some refactoring to that.. If Francisco or Howard are not doing it before me... , otherwise I will see if i can get some extra strength to do it this year still
Actions
6. Re: A deadlock found in replication

gaohoward Dec 20, 2012 3:01 AM (in response to clebert.suconic)

Hi Clebert if you have time to take this one I think that's best. I'm not quite understand this part of code so I think the safest thing to do is pass it on to you.
Actions
7. Re: A deadlock found in replication

clebert.suconic Dec 20, 2012 6:52 PM (in response to gaohoward)

This is a dead lock between startSynchronization and cleanup getting out of page mode

and there's one thing I don't understand...

StartSynchronization is holding these locks:

storageManager.lock();
     bindings.lock();
     messageJounral.lock();
         paingManager.lock();
               trying to get a pageStore.lock();

While the cleanup:
     pagingManager.lock();
         ...   that means... the pagingManager lock is already taken here.

So, I think the issue here is about stopping / starting the servers too fast on the testsuite leading to this issue. We should either guarantee that these components are final and not removed, or make sure that the server is really stopped before starting it again.

Unless I am missing something there?

@Howard: do you remember the test that dead locked?
Actions
8. Re: A deadlock found in replication

gaohoward Dec 20, 2012 10:19 PM (in response to clebert.suconic)
I don't write down the test at the moment of hang. But I think it's ReplicatedPagedFailoverTest as it's in the stack thread name.

I attached the full thread dump fyi.

Howard

stack_trace.txt.zip 6.3 KB
Actions
9. Re: A deadlock found in replication

borges Dec 21, 2012 9:59 AM (in response to clebert.suconic)

Regardless of start/stop calls to the server, there are 2 locks being taken in different order by 2 methods. While the likelihood of a dead-lock in production is low, the race is pretty straightforward.

I just pushed a commit (currently only at https://github.com/FranciscoBorges/hornetq) that fixes this. It is really just a matter of ensuring that these related locks are always taken in the right order.

The (actual) problem posed by this case, is reviewing /all/ calls from the paging system to the JournalStorageManager and making sure we do not have other instances _locks taken in the wrong order_.
Actions
10. Re: A deadlock found in replication

clebert.suconic Dec 21, 2012 11:19 AM (in response to borges)

The stop is resetting the pageManager to null. A further start will create a new pageManager and it's possible the system will have different locks. I analyzed the thread dump and there is no way the system would get into that state with a single lock. Both Cleanup and startReplicate are holding the PageManager.lock() what would be impossible, unless one of the threads has a different instance of the lock.

So, maybe you found another issue beyond that point (what's good actually), but there's certainly an issue with setting these variables to null. We should review if we still want to do that. (Not the pageManager at least.)
Actions
11. Re: A deadlock found in replication

borges Jan 9, 2013 5:39 AM (in response to gaohoward)

FWIW, this dead lock and another much like it were fixed in this commit https://github.com/hornetq/hornetq/commit/c6e4cae5e008028e4cbc488eb43943fa41f3168e
Actions

Go to original post