5 Replies Latest reply on Oct 21, 2011 12:31 AM by clebert.suconic

    NettyMultiThreadRandomReattachTest hang issue in HornetQ test suite

    gaohoward

      A quick look at Francisco's stack shows the test is waiting for the test thread to finish (thread.join())

       

      "main" prio=10 tid=0x089f8c00 nid=0x1962 in Object.wait() [0xb6dbc000]

         java.lang.Thread.State: WAITING (on object monitor)

          at java.lang.Object.wait(Native Method)

          - waiting on <0x64647fd8> (a org.hornetq.tests.integration.cluster.reattach.MultiThreadReattachSupport$1Runner)

          at java.lang.Thread.join(Thread.java:1186)

          - locked <0x64647fd8> (a org.hornetq.tests.integration.cluster.reattach.MultiThreadReattachSupport$1Runner)

          at java.lang.Thread.join(Thread.java:1239)

          at org.hornetq.tests.integration.cluster.reattach.MultiThreadReattachSupport.runMultipleThreadsFailoverTest(MultiThreadReattachSupport.java:163)

          at org.hornetq.tests.integration.cluster.reattach.MultiThreadRandomReattachTestBase.runTestMultipleThreads(MultiThreadRandomReattachTestBase.java:1291)

          at org.hornetq.tests.integration.cluster.reattach.MultiThreadRandomReattachTestBase.runTestMultipleThreads(MultiThreadRandomReattachTestBase.java:1282)

          at org.hornetq.tests.integration.cluster.reattach.MultiThreadRandomReattachTestBase.testA(MultiThreadRandomReattachTestBase.java:76)

          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

          at java.lang.reflect.Method.invoke(Method.java:597)

          at junit.framework.TestCase.runTest(TestCase.java:164)

          at junit.framework.TestCase.runBare(TestCase.java:130)

          at junit.framework.TestResult$1.protect(TestResult.java:106)

          at junit.framework.TestResult.runProtected(TestResult.java:124)

          at junit.framework.TestResult.run(TestResult.java:109)

          at junit.framework.TestCase.run(TestCase.java:120)

          at junit.framework.TestSuite.runTest(TestSuite.java:230)

          at junit.framework.TestSuite.run(TestSuite.java:225)

          at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420)

          at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911)

          at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768)

       

      The code is here:

       

      ...

       

      for (Runner thread : threads)

      {

         thread.join(); //line 163 in MultiThreadReattachSupport

      ...

       

      each thread is executing the method org.hornetq.tests.cluster.reattach.MultiThreadRandomReattachTestBase.doTestA(ClientSessionFactory, int)

       

      So I think some thread get stuck in this method somewhere. But so far I couldn't tell which thread in the stack is the one.

       

      (attached Francisco's stack)

        • 1. Re: NettyMultiThreadRandomReattachTest hang issue in HornetQ test suite
          gaohoward

          and I'd suggest re-run the test with some additional logs in doTestA(ClientSessionFactory, int) and trace where it gets stuck.

          • 2. Re: NettyMultiThreadRandomReattachTest hang issue in HornetQ test suite
            borges

            Howard, there is more than one thread being blocked.

             

            A resume of what I see in that stack-trace is:

             

            Thread 0xb3d88000 is holding locks that block two threads:

             

            1. tid=0x08ae8800, (lock=0x6464bb08)
            2. tid=0xb3d95400, (lock=0x64808f78)

             

            Thread  tid=0xb3d8cc00 (the likely involved in the dead-locking). Is blockED at NettyConnection.write(NettyConnection.java:182) (which uses an AtomicBoolean as lock mechanism, so it doesn't appear as blocked in the stack-trace). Now this thread is blocking:

            1. tid=0x08b4a800 (lock=0x6480c1a0)

             

            The question is who is blocking 0xb3d8cc00 through the AtomicBoolean? The stack won't say for sure, but there are two threads holding such a lock at NettyConnection:

            1. tid=0xb3d88000
            2. tid=0x08b49000

             

            Perhaps tid=0xb3d88000 is being held back while trying to write to the socket? (and holding everyone else back as well?).

             

            [...]

             

            From what I read, the exact behavior of yield is platform dependent. Since this hasn't yet been reproduced in 64b Sun-JVMs (but it happens immediatelly in 32b Sun-JVMs) it leads me to suspect it is the yield in NettyConnection that (somehow) causes the hang. It is not the call Thread.yield(), using Thread.sleep(0) (different but specified behavior) doesn't change anything. The test still hangs.

            • 3. Re: NettyMultiThreadRandomReattachTest hang issue in HornetQ test suite
              clebert.suconic

              From what I read, the exact behavior of yield is platform dependent. Since this hasn't yet been reproduced in 64b Sun-JVMs (but it happens immediately in 32b Sun-JVMs) it leads me to suspect it is the yield in NettyConnection that (somehow) causes the hang.

               

              Wouldn't this be easy to verify on the thread dump? One thread trying to yield while another is at a position where the lock is supposed to be released?

               

               

              But it's hard to determine where this lock is coming from... :-/

               

               

              I still suspect of a plain dead lock. Perhaps we could substitute (for test purposes) the writeLock by a regular synchronized object and see how it behaves, but that may create a heinzenbug

              • 4. Re: NettyMultiThreadRandomReattachTest hang issue in HornetQ test suite
                borges

                FWIW, I replaced the AtomicBoolean trick with an actual Lock (ReentrantLock, to be specific). The test doens't hang anymore.

                 

                I understand that the AtomicBoolean choice was made with performance in mind and after several benchmarks.

                 

                Are these tests documented somehow (checked in the source tree)? As we just saw, different JVMs are giving different results (32b Sun JVM hangs, while 64b Sun JVM (on faster machines) don't).

                 

                As far as I can tell, this is a case of performance choices hurting reliability. Without reproducible performance benchmarks I can't find a justification for not using a regular Lock there. Heck, even with reproducible benchmarks, having a fast program that may hang is IMO unacceptable.

                 

                I committed the change from AtomicBoolean to ReentrantLock (svn r11569). I think we should proceed to verify if that solves it with the IBM JDK as well, and if there is a performance impact or not.

                • 5. Re: NettyMultiThreadRandomReattachTest hang issue in HornetQ test suite
                  clebert.suconic

                  Can't we just use a Semaphore here, what will be the oldes form of Mutex I know? We don't reentrant control here, so I guess this would be more suitable.

                   

                   

                  Any objections?