1 2 Previous Next 17 Replies Latest reply on May 27, 2011 9:37 AM by galder.zamarreno

Node 3 fails to join the cluster after restart

kapilnayar1 Sep 7, 2010 4:37 PM

Hi,

I have a 3 node cluster (Node 1, 2 and 3).

After the 3rd node is restarted, it fails to join back the cluster, and repeatedly gives the log:

[Rehasher-Host1217-11079] [JoinTask] Retrieved old consistent hash address list null

The debug logs from the Node 1, 2 and failing Node 3 are attached.

Anybody observing any similar behaviour with 2 or 3 nodes or may have a clue for the resolution?

Thanks,

Kapil

1. Re: Node 3 fails to join the cluster after restart

galder.zamarreno Sep 8, 2010 9:50 AM (in response to kapilnayar1)

Maybe rehashing is not completing? You should get a thread dump and see what the rehasher thread is doing. Approximately, how big is the in memory state in the nodes?
Actions
2. Re: Node 3 fails to join the cluster after restart

kapilnayar1 Sep 15, 2010 2:39 PM (in response to galder.zamarreno)

Hi Galder,

The in-memory state is about 5MB (50000 values of around 100bytes each).

The thread dump for re-hasher and main shows:

"Rehasher-Host" daemon prio=2 tid=0x31356400 nid=0x132c waiting on condition [0x3444f000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
        at java.lang.Thread.sleep(Native Method)
        at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:184)
        at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:83)
        at org.infinispan.distribution.RehashTask.call(RehashTask.java:52)
        at org.infinispan.distribution.RehashTask.call(RehashTask.java:32)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

"main" prio=6 tid=0x003b7c00 nid=0x1cf4 waiting on condition [0x009fe000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for <0x05cad0d8> (a java.util.concurrent.FutureTask$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
        at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
        at java.util.concurrent.FutureTask.get(FutureTask.java:83)
        at org.infinispan.distribution.DistributionManagerImpl.waitForJoinToComplete(DistributionManagerImpl.java:145)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:170)
        at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:852)
        at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:672)
        at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:574)
        at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:148)
        at org.infinispan.CacheDelegate.start(CacheDelegate.java:288)
        at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:446)
        at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:409)
        at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:386)

Thanks,
Kapil
Actions
3. Re: Node 3 fails to join the cluster after restart

sachinsukhtankar Sep 21, 2010 1:24 PM (in response to galder.zamarreno)

I get the issue too, rather a more general problem - You can add nodes to a cluster rehash works but if a node of cluster dies or shutdown. It is not possible to add a new node or restart the same node. This is all on same machine/different ports.

Any work arounds this is, quite a blocker for us.
Actions
4. Re: Node 3 fails to join the cluster after restart

vblagojevic Sep 22, 2010 11:38 AM (in response to sachinsukhtankar)

Would you please open a JIRA issue with all these details specified. Does it matter if node 3 is shutdown or crashes before a restart? How about if completely unrelated node 4 is started, can it join without a problem? Lets get to the bottom of this one.

Cheers,
Vladimir
Actions
5. Re: Node 3 fails to join the cluster after restart

sachinsukhtankar Sep 22, 2010 3:54 PM (in response to vblagojevic)

I will create a jira issue with all the details, It doesn't matter it is 3 or 4 nodes, As soon a node of cluster is shutdown, it not possible to add a new node or restart the shutdown node.
Actions
6. Re: Node 3 fails to join the cluster after restart

ntsankov Sep 23, 2010 5:36 AM (in response to kapilnayar1)

We had the same problem, and I can add this: if while node3 was waiting, you shutdown node2 for example, node3 woke up and joined node1 in the cluster. Originally posted here: http://community.jboss.org/message/560180#560180

I no longer have the setup, so I can't test it again and capture the stacktrace of the rehasher thread, but situation seems quite the same.
Actions
7. Re: Node 3 fails to join the cluster after restart

sachinsukhtankar Sep 23, 2010 1:40 PM (in response to ntsankov)

Jira ticket for this

https://jira.jboss.org/browse/ISPN-668
Actions
8. Re: Node 3 fails to join the cluster after restart

kapilnayar1 Sep 23, 2010 1:50 PM (in response to sachinsukhtankar)

I don't have the test setup to verify but I suspect this happens when (if) the co-ordinator node restarts (or is shut down).
Actions
9. Re: Node 3 fails to join the cluster after restart

vblagojevic Sep 27, 2010 2:08 PM (in response to kapilnayar1)

Guys, I need as many details as you can provide.

1) What Infinispan release did you use?
2) How did the setup look like? Was it bare Infinispan or with HotRod? If under HotRod can you reproduce it with bare setup, say Infinispan Gui Demo?
3) Your configuration file?
4) Anything specific about your deployment, multiple Infinispan nodes on physical machine? One per machine etc etc

Thanks,
Vladimir
Actions
10. Re: Node 3 fails to join the cluster after restart

sachinsukhtankar Sep 27, 2010 3:25 PM (in response to kapilnayar1)

Yes !! It is due to coordinator node shutdown.

Here are the details -
Infinispan release - 4.1.0.FINAL
Setup - Basic Infinispan with hotrod, different nodes on same machine
Config file - attached to jira issue
Scenario to repeat - If in a 3 node setup, if a first (coordinator) node goes down, rehash happens to the other nodes.But if you try to restart the first node or even start a new node (new port on same machine) it cannot join the cluster. The error - Retrieved old consistent hash address list null.

It is easy to reproduce it with the Infinispan Gui Demo, I tried with 3 nodes. Start three nodes add random entries, stop the coordinator node and then try to add a node.

Hope this helps to reproduce.

Thanks,
Actions
11. Re: Node 3 fails to join the cluster after restart

vblagojevic Sep 27, 2010 5:52 PM (in response to sachinsukhtankar)

Aha, so when you say "stop the coordinator node" you actually invoked "Stop Cache" from "Control Panel" on coordinator node's gui. Is that correct?
Actions

12. Re: Node 3 fails to join the cluster after restart

ntsankov Sep 28, 2010 4:41 AM (in response to vblagojevic)

Our setup was bare (no hotrod), on 3 different phys. machines, single Infinispan node per machine, version 4.1.0-CR3.

Node was stopped by killing the java program running it, using Ctl-c in the console. As I mentioned on the jira issue, OS was FreeBSD.

Following is the conf. file used. JGroups conf. file is the provided example for tcp with tcpping

<infinispan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="urn:infinispan:config:4.0 http://www.infinispan.org/schemas/infinispan-config-4.1.xsd"
xmlns="urn:infinispan:config:4.0">
 
<global>
<transport clusterName="infinispan-cluster"
distributedSyncTimeout="50000" 
transportClass="org.infinispan.remoting.transport.jgroups.JGroupsTransport">
<properties>
<property name="configurationFile" value="jgroups-tcp-tcpping.xml" />
</properties>
</transport>
<globalJmxStatistics enabled="true"/>
</global>
<default>      
<jmxStatistics enabled="true"/>
<!--<expiration maxIdle="30000"/>-->
<clustering mode="d">
</clustering>
</default>
 
</infinispan>

13. Re: Node 3 fails to join the cluster after restart

sachinsukhtankar Sep 28, 2010 1:59 PM (in response to vblagojevic)

I just killed the GUI, no stop cache. For hotrod used Ctrl ^C.
Actions
14. Re: Node 3 fails to join the cluster after restart

kapilnayar1 Sep 29, 2010 8:55 AM (in response to vblagojevic)

Used it in an embedded mode and observed similar behavior when the node is killed using Ctrl-C.
Actions

1 2 Previous Next

Go to original post