6 Replies Latest reply on Mar 11, 2011 5:09 AM by galder.zamarreno

    New node can not join a running cluster

    changgeng

      Hi,

       

      I'm testing our system with 4.2.1.CR3, and the biggest problem issue I have now is that when there are some load(40 testing threads), the newly node can not join the cluster.

       

      When I dump the stack trace on the newly added node and analyzed the log file, I found the org.infinispan.distribution.JoinTask.retrieveOldConsistentHash method is timed out for the first time. Although the newly added node would call that method repeatedly, the coodinator would return null since it think a rehash is still in progress.

       

      I also spent some time analyzing why the first retrieveOldConsistentHash call would timed out, and I found that the coordinator did generate a valid response, and the newly added node received that response. During unmarshalling the response message, the OOB thread hangs somewhere, so I can find the thread is running with jstack.

       

       

      "OOB-32,ana-cluster,perf-srm2-50105" prio=10 tid=0x0000000058603000 nid=0x2063 in Object.wait() [0x000000004a29f000]

         java.lang.Thread.State: RUNNABLE

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadClassDescriptor(RiverUnmarshaller.java:905)

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadNewObject(RiverUnmarshaller.java:1183)

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:270)

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)

              at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)

              at org.infinispan.remoting.transport.jgroups.JGroupsAddress$Externalizer.readObject(JGroupsAddress.java:70)

              at org.infinispan.marshall.jboss.ConstantObjectTable$ExternalizerAdapter.readObject(ConstantObjectTable.java:319)

              at org.infinispan.marshall.jboss.ConstantObjectTable.readObject(ConstantObjectTable.java:306)

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:357)

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)

              at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)

              at org.infinispan.marshall.exts.LinkedListExternalizer.readObject(LinkedListExternalizer.java:55)

              at org.infinispan.marshall.jboss.ConstantObjectTable$ExternalizerAdapter.readObject(ConstantObjectTable.java:319)

              at org.infinispan.marshall.jboss.ConstantObjectTable.readObject(ConstantObjectTable.java:306)

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:357)

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)

              at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)

              at org.infinispan.remoting.responses.SuccessfulResponse$Externalizer.readObject(SuccessfulResponse.java:63)

              at org.infinispan.marshall.jboss.ConstantObjectTable$ExternalizerAdapter.readObject(ConstantObjectTable.java:319)

              at org.infinispan.marshall.jboss.ConstantObjectTable.readObject(ConstantObjectTable.java:306)

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:357)

              at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)

              at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)

              at org.infinispan.marshall.jboss.GenericJBossMarshaller.objectFromObjectStream(GenericJBossMarshaller.java:172)

              at org.infinispan.marshall.VersionAwareMarshaller.objectFromByteBuffer(VersionAwareMarshaller.java:115)

              at org.infinispan.remoting.transport.jgroups.MarshallerAdapter.objectFromByteBuffer(MarshallerAdapter.java:26)

              at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:500)

              at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:365)

              at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)

              at org.jgroups.JChannel.up(JChannel.java:1465)

              at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:954)

              at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:430)

              at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:265)

              at org.jgroups.protocols.FRAG2.up(FRAG2.java:190)

              at org.jgroups.protocols.FC.up(FC.java:503)

              at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)

              at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)

              at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:577)

              at org.jgroups.protocols.UNICAST.up(UNICAST.java:295)

              at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)

              at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)

              at org.jgroups.protocols.FD.up(FD.java:266)

              at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:269)

              at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)

              at org.jgroups.protocols.Discovery.up(Discovery.java:292)

              at org.jgroups.protocols.PING.up(PING.java:67)

              at org.jgroups.protocols.MPING.up(MPING.java:175)

              at org.jgroups.protocols.TP.passMessageUp(TP.java:1093)

              at org.jgroups.protocols.TP.access$100(TP.java:56)

              at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1633)

              at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1615)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

              at java.lang.Thread.run(Thread.java:619)

       

         Locked ownable synchronizers:

              - <0x00002aaaee25ca38> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

       

      And the pstack shows following about this thread:

      Thread 221 (Thread 0x4a2a2940 (LWP 8291)):

      #0  0x000000308f40ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

      #1  0x00002ab1008b7807 in os::PlatformEvent::park() () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so

      #2  0x00002ab10097bf2a in ObjectMonitor::wait(long, bool, Thread*) () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so

      #3  0x00002ab100979330 in ObjectSynchronizer::waitUninterruptibly(Handle, long, Thread*) () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so

      #4  0x00002ab10067638b in instanceKlass::initialize_impl(instanceKlassHandle, Thread*) () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so

      #5  0x00002ab1006759ca in instanceKlass::initialize(Thread*) () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so

      #6  0x00002ab10069c49c in InterpreterRuntime::_new(JavaThread*, constantPoolOopDesc*, int) () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so

      #7  0x00002aaaae96a224 in ?? ()

      #8  0x00002aaad5146758 in ?? ()

      #9  0x00002aaaae96a1e8 in ?? ()

      #10 0x000000004a29fe40 in ?? ()

      #11 0x00002aabb53b44e5 in ?? ()

      #12 0x000000004a29ff18 in ?? ()

      #13 0x00002aabb53ca878 in ?? ()

      #14 0x0000000000000000 in ?? ()

       

      It seems the problem is related with jboss marshalling and the JVM class initialization process.

       

      This issue doesn't exist if there's no load on the existing cluster. I also attached some logentries from the coordinator and the newly added node.

        • 1. New node can not join a running cluster
          changgeng

          This should be a bug, but I'm not sure it's problem of jboss marshalling, infinispan, or jre itself. The appearance of this issue is a bit similar with the one discussed at http://forums.oracle.com/forums/thread.jspa?threadID=1158865&tstart=195 .

           

          If I manually create an instance of SimpleClassDescriptor before starting the cache manager and the caches on the newly added node, the retrieveOldConsistentHash method will always succeed. This may be hard to create a unit test to reproduce since it's related with class loading.

          • 2. New node can not join a running cluster
            galder.zamarreno

            Changgeng Li, that stacktrace shows a thread in Runnable state, so if the same stacktrace always appears, maybe there's a looping issue somewhere? That particular thread is not blocked at all, the thread state is RUNNABLE which means that it's running.

             

            "OOB-40,ana-cluster,perf-srm3-21615" prio=10 tid=0x00002aabc837d800 nid=0x6ec3 in Object.wait() [0x0000000049fc9000]

               java.lang.Thread.State: RUNNABLE

             

            It'll be interesting if you could share a full thread dump in the node that's supposed to be reading the response. In fact, I'd suggest taking several of them, 1 minute appart or so. Also can you get thread dumps from the node sending data to this node, just in case we're missing something here.

            • 3. New node can not join a running cluster
              dmlloyd

              Are you certain you're not running out of memory?  If the GC is spinning a lot it might show up like this.  There is no locking of this sort taking place by JBMAR that I'm aware of; this is a very strange issue.  RiverUnmarshaller certainly is never invoking wait().  I can only assume that the JVM is eliding stack trace entries (it does that sometimes, especially when lazy-linking classes); it might be possible to get a better view of what's going on threadwise via jconsole or a debugger though.

              • 4. New node can not join a running cluster
                changgeng

                Let me reproduce the issue on my Monday. Although the jstack shows that thread is in RUNNABLE state, it also shows it's 'in Object.wait()', and this is consistent with the output of pstack.

                • 5. New node can not join a running cluster
                  changgeng

                  We have a very big heap for JVM and it's not very likely to run out of memory. I will check whether there's problem with perm generation when I reproduce this issue.

                  • 6. New node can not join a running cluster
                    galder.zamarreno

                    Well, precisely because you have a big heap, you might be encountering long GC pauses, enable GC logging to verify it.