New node can not join a running cluster
changgeng Feb 23, 2011 4:51 AMHi,
I'm testing our system with 4.2.1.CR3, and the biggest problem issue I have now is that when there are some load(40 testing threads), the newly node can not join the cluster.
When I dump the stack trace on the newly added node and analyzed the log file, I found the org.infinispan.distribution.JoinTask.retrieveOldConsistentHash method is timed out for the first time. Although the newly added node would call that method repeatedly, the coodinator would return null since it think a rehash is still in progress.
I also spent some time analyzing why the first retrieveOldConsistentHash call would timed out, and I found that the coordinator did generate a valid response, and the newly added node received that response. During unmarshalling the response message, the OOB thread hangs somewhere, so I can find the thread is running with jstack.
"OOB-32,ana-cluster,perf-srm2-50105" prio=10 tid=0x0000000058603000 nid=0x2063 in Object.wait() [0x000000004a29f000]
java.lang.Thread.State: RUNNABLE
at org.jboss.marshalling.river.RiverUnmarshaller.doReadClassDescriptor(RiverUnmarshaller.java:905)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadNewObject(RiverUnmarshaller.java:1183)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:270)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)
at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)
at org.infinispan.remoting.transport.jgroups.JGroupsAddress$Externalizer.readObject(JGroupsAddress.java:70)
at org.infinispan.marshall.jboss.ConstantObjectTable$ExternalizerAdapter.readObject(ConstantObjectTable.java:319)
at org.infinispan.marshall.jboss.ConstantObjectTable.readObject(ConstantObjectTable.java:306)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:357)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)
at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)
at org.infinispan.marshall.exts.LinkedListExternalizer.readObject(LinkedListExternalizer.java:55)
at org.infinispan.marshall.jboss.ConstantObjectTable$ExternalizerAdapter.readObject(ConstantObjectTable.java:319)
at org.infinispan.marshall.jboss.ConstantObjectTable.readObject(ConstantObjectTable.java:306)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:357)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)
at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)
at org.infinispan.remoting.responses.SuccessfulResponse$Externalizer.readObject(SuccessfulResponse.java:63)
at org.infinispan.marshall.jboss.ConstantObjectTable$ExternalizerAdapter.readObject(ConstantObjectTable.java:319)
at org.infinispan.marshall.jboss.ConstantObjectTable.readObject(ConstantObjectTable.java:306)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:357)
at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:207)
at org.jboss.marshalling.AbstractUnmarshaller.readObject(AbstractUnmarshaller.java:85)
at org.infinispan.marshall.jboss.GenericJBossMarshaller.objectFromObjectStream(GenericJBossMarshaller.java:172)
at org.infinispan.marshall.VersionAwareMarshaller.objectFromByteBuffer(VersionAwareMarshaller.java:115)
at org.infinispan.remoting.transport.jgroups.MarshallerAdapter.objectFromByteBuffer(MarshallerAdapter.java:26)
at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:500)
at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:365)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:771)
at org.jgroups.JChannel.up(JChannel.java:1465)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:954)
at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:430)
at org.jgroups.protocols.pbcast.STREAMING_STATE_TRANSFER.up(STREAMING_STATE_TRANSFER.java:265)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:190)
at org.jgroups.protocols.FC.up(FC.java:503)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:577)
at org.jgroups.protocols.UNICAST.up(UNICAST.java:295)
at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
at org.jgroups.protocols.FD.up(FD.java:266)
at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:269)
at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
at org.jgroups.protocols.Discovery.up(Discovery.java:292)
at org.jgroups.protocols.PING.up(PING.java:67)
at org.jgroups.protocols.MPING.up(MPING.java:175)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1093)
at org.jgroups.protocols.TP.access$100(TP.java:56)
at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1633)
at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1615)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Locked ownable synchronizers:
- <0x00002aaaee25ca38> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
And the pstack shows following about this thread:
Thread 221 (Thread 0x4a2a2940 (LWP 8291)):
#0 0x000000308f40ab99 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00002ab1008b7807 in os::PlatformEvent::park() () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#2 0x00002ab10097bf2a in ObjectMonitor::wait(long, bool, Thread*) () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#3 0x00002ab100979330 in ObjectSynchronizer::waitUninterruptibly(Handle, long, Thread*) () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#4 0x00002ab10067638b in instanceKlass::initialize_impl(instanceKlassHandle, Thread*) () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#5 0x00002ab1006759ca in instanceKlass::initialize(Thread*) () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#6 0x00002ab10069c49c in InterpreterRuntime::_new(JavaThread*, constantPoolOopDesc*, int) () from /usr/lib64/jvm/jdk1.6.0_21/jre/lib/amd64/server/libjvm.so
#7 0x00002aaaae96a224 in ?? ()
#8 0x00002aaad5146758 in ?? ()
#9 0x00002aaaae96a1e8 in ?? ()
#10 0x000000004a29fe40 in ?? ()
#11 0x00002aabb53b44e5 in ?? ()
#12 0x000000004a29ff18 in ?? ()
#13 0x00002aabb53ca878 in ?? ()
#14 0x0000000000000000 in ?? ()
It seems the problem is related with jboss marshalling and the JVM class initialization process.
This issue doesn't exist if there's no load on the existing cluster. I also attached some logentries from the coordinator and the newly added node.