1 2 Previous Next 17 Replies Latest reply on May 27, 2011 9:37 AM by galder.zamarreno

    Node 3 fails to join the cluster after restart

    kapilnayar1

      Hi,

       

      I have a 3 node cluster (Node 1, 2 and 3).

       

      After the 3rd node is restarted, it fails to join back the cluster, and repeatedly gives the log:

      [Rehasher-Host1217-11079] [JoinTask] Retrieved old consistent hash address list null

       

      The debug logs from the Node 1, 2 and failing Node 3 are attached.

       

      Anybody observing any similar behaviour with 2 or 3 nodes or may have a clue for the resolution?

       

      Thanks,

      Kapil

        • 1. Re: Node 3 fails to join the cluster after restart
          galder.zamarreno

          Maybe rehashing is not completing? You should get a thread dump and see what the rehasher thread is doing. Approximately, how big is the in memory state in the nodes?

          • 2. Re: Node 3 fails to join the cluster after restart
            kapilnayar1

            Hi Galder,

             

            The in-memory state is about 5MB (50000 values of around 100bytes each).

             

            The thread dump for re-hasher and main shows:

             

            "Rehasher-Host" daemon prio=2 tid=0x31356400 nid=0x132c waiting on condition [0x3444f000]
               java.lang.Thread.State: TIMED_WAITING (sleeping)
                    at java.lang.Thread.sleep(Native Method)
                    at org.infinispan.distribution.JoinTask.retrieveOldCH(JoinTask.java:184)
                    at org.infinispan.distribution.JoinTask.performRehash(JoinTask.java:83)
                    at org.infinispan.distribution.RehashTask.call(RehashTask.java:52)
                    at org.infinispan.distribution.RehashTask.call(RehashTask.java:32)
                    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
                    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                    at java.lang.Thread.run(Thread.java:619)

             

            "main" prio=6 tid=0x003b7c00 nid=0x1cf4 waiting on condition [0x009fe000]
               java.lang.Thread.State: WAITING (parking)
                    at sun.misc.Unsafe.park(Native Method)
                    - parking to wait for  <0x05cad0d8> (a java.util.concurrent.FutureTask$Sync)
                    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
                    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
                    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
                    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
                    at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
                    at java.util.concurrent.FutureTask.get(FutureTask.java:83)
                    at org.infinispan.distribution.DistributionManagerImpl.waitForJoinToComplete(DistributionManagerImpl.java:145)
                    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
                    at java.lang.reflect.Method.invoke(Method.java:597)
                    at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:170)
                    at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:852)
                    at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:672)
                    at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:574)
                    at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:148)
                    at org.infinispan.CacheDelegate.start(CacheDelegate.java:288)
                    at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:446)
                    at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:409)
                    at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:386)

             

            Thanks,

            Kapil

            • 3. Re: Node 3 fails to join the cluster after restart
              sachinsukhtankar

              I get the issue too, rather a more general problem - You can add nodes to a cluster rehash works but if a node of cluster dies or shutdown. It is not possible to add a new node or restart the same node. This is all on same machine/different ports.

               

              Any work arounds this is, quite a blocker for us.

              • 4. Re: Node 3 fails to join the cluster after restart
                vblagojevic

                Would you please open a JIRA issue with all these details specified. Does it matter if node 3 is shutdown or crashes before a restart? How about if completely unrelated node 4 is started, can it join without a problem? Lets get to the bottom of this one.

                 

                Cheers,

                Vladimir

                • 5. Re: Node 3 fails to join the cluster after restart
                  sachinsukhtankar

                  I will create a jira issue with all the details, It doesn't matter it is 3 or 4 nodes, As soon a node of cluster is shutdown, it not possible to add a new node or restart the shutdown node.

                  • 6. Re: Node 3 fails to join the cluster after restart
                    ntsankov

                    We had the same problem, and I can add this: if while node3 was waiting, you shutdown node2 for example, node3 woke up and joined node1 in the cluster. Originally posted here: http://community.jboss.org/message/560180#560180

                     

                    I no longer have the setup, so I can't test it again and capture the stacktrace of the rehasher thread, but situation seems quite the same.

                    • 7. Re: Node 3 fails to join the cluster after restart
                      sachinsukhtankar
                      • 8. Re: Node 3 fails to join the cluster after restart
                        kapilnayar1

                        I don't have the test setup to verify but I suspect this happens when (if) the co-ordinator node restarts (or is shut down).

                        • 9. Re: Node 3 fails to join the cluster after restart
                          vblagojevic

                          Guys, I need as many details as you can provide.

                           

                          1) What Infinispan release did you use?

                          2) How did the setup look like? Was it bare Infinispan or with HotRod? If under HotRod can you reproduce it with bare setup, say Infinispan Gui Demo?

                          3) Your configuration file?

                          4) Anything specific about your deployment, multiple Infinispan nodes on physical machine? One per machine etc etc

                           

                          Thanks,

                          Vladimir

                          • 10. Re: Node 3 fails to join the cluster after restart
                            sachinsukhtankar

                            Yes !! It is due to coordinator node shutdown.

                             

                            Here are the details -

                            Infinispan release - 4.1.0.FINAL

                            Setup - Basic Infinispan with hotrod, different nodes on same machine

                            Config file - attached to jira issue

                            Scenario to repeat - If in a 3 node setup, if a first (coordinator) node goes down, rehash happens to the other nodes.But if you try to restart the first node or even start a new node (new port on same machine)  it cannot join the cluster. The error - Retrieved old consistent hash address list null.

                             

                            It is easy to reproduce it with the Infinispan Gui Demo, I tried with 3 nodes. Start three nodes add random entries, stop the coordinator node and then try to add a node.

                             

                            Hope this helps to reproduce.

                             

                            Thanks,

                            • 11. Re: Node 3 fails to join the cluster after restart
                              vblagojevic

                              Aha, so when you say "stop the coordinator node" you actually invoked "Stop Cache" from "Control Panel" on coordinator node's gui. Is that correct?

                              • 12. Re: Node 3 fails to join the cluster after restart
                                ntsankov

                                Our setup was bare (no hotrod), on 3 different phys. machines, single Infinispan node per machine, version 4.1.0-CR3.

                                Node was stopped by killing the java program running it, using Ctl-c in the console. As I mentioned on the jira issue, OS was FreeBSD.

                                Following is the conf. file used. JGroups conf. file is the provided example for tcp with tcpping

                                 

                                <infinispan xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                                xsi:schemaLocation="urn:infinispan:config:4.0 http://www.infinispan.org/schemas/infinispan-config-4.1.xsd"
                                xmlns="urn:infinispan:config:4.0">
                                 
                                <global>
                                <transport clusterName="infinispan-cluster"
                                distributedSyncTimeout="50000" 
                                transportClass="org.infinispan.remoting.transport.jgroups.JGroupsTransport">
                                <properties>
                                <property name="configurationFile" value="jgroups-tcp-tcpping.xml" />
                                </properties>
                                </transport>
                                <globalJmxStatistics enabled="true"/>
                                </global>
                                <default>      
                                <jmxStatistics enabled="true"/>
                                <!--<expiration maxIdle="30000"/>-->
                                <clustering mode="d">
                                </clustering>
                                </default>
                                 
                                </infinispan>
                                
                                
                                • 13. Re: Node 3 fails to join the cluster after restart
                                  sachinsukhtankar

                                  I just killed the GUI, no stop cache. For hotrod used Ctrl ^C.

                                  • 14. Re: Node 3 fails to join the cluster after restart
                                    kapilnayar1

                                    Used it in an embedded mode and observed similar behavior when the node is killed using Ctrl-C.

                                    1 2 Previous Next