5 Replies Latest reply on Jun 15, 2016 12:59 PM by manojdunna

    Node cannot join cluster correctly after restart

    maruta.s

      Hello,

      we are facing problem, that after some unexpected server restart, node is not able to join the cluster again.

      Our environment is 2 node cluster (infinispan rest servers). One node1 (server) was restarted and after that it cannot join the cluster until second node (node2) was restarted:

       

      Logs from node1:

      After restart there are many errors regarding transactions:

       

      Sep 29, 2014 3:03:53 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport stop

      INFO: ISPN000080: Disconnecting and closing JGroups Channel

      Sep 29, 2014 3:03:54 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport stop

      INFO: ISPN000082: Stopping the RpcDispatcher

      Sep 29, 2014 3:05:20 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport start

      INFO: ISPN000078: Starting JGroups Channel

      Sep 29, 2014 3:05:21 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS984-56770|15] [LOUAPPWPS984-56770, LOUAPPWPS983-56765]

      Sep 29, 2014 3:05:21 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport startJGroupsChannelIfNeeded

      INFO: ISPN000079: Cache local address is LOUAPPWPS983-56765, physical addresses are [133.27.18.204:7800]

      Sep 29, 2014 3:05:21 AM org.infinispan.factories.GlobalComponentRegistry start

      INFO: ISPN000128: Infinispan version: Infinispan 'Delirium' 5.2.6.Final

      Sep 29, 2014 3:05:22 AM org.infinispan.factories.TransactionManagerFactory construct

      INFO: ISPN000161: Using a batchMode transaction manager

      Sep 29, 2014 3:05:22 AM org.infinispan.jmx.CacheJmxRegistration start

      INFO: ISPN000031: MBeans were successfully registered to the platform MBean server.

      Sep 29, 2014 3:05:23 AM org.infinispan.factories.TransactionManagerFactory construct

      INFO: ISPN000161: Using a batchMode transaction manager

      Sep 29, 2014 3:05:23 AM org.infinispan.jmx.CacheJmxRegistration start

      INFO: ISPN000031: MBeans were successfully registered to the platform MBean server.

      Sep 29, 2014 3:05:23 AM org.infinispan.factories.TransactionManagerFactory construct

      INFO: ISPN000161: Using a batchMode transaction manager

      Sep 29, 2014 3:05:23 AM org.infinispan.jmx.CacheJmxRegistration start

      INFO: ISPN000031: MBeans were successfully registered to the platform MBean server.

      Sep 29, 2014 3:06:18 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS983-56765|16] [LOUAPPWPS983-56765]

      Sep 29, 2014 3:06:38 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000093: Received new, MERGED cluster view: MergeView::[LOUAPPWPS983-56765|17] [LOUAPPWPS983-56765, LOUAPPWPS984-56770], subgroups=[LOUAPPWPS984-56770|15] [LOUAPPWPS984-56770], [LOUAPPWPS983-56765|16] [LOUAPPWPS983-56765]

      Sep 29, 2014 3:07:33 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport stop

      INFO: ISPN000080: Disconnecting and closing JGroups Channel

      Sep 29, 2014 3:07:33 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport stop

      INFO: ISPN000082: Stopping the RpcDispatcher

      Sep 29, 2014 3:08:45 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport start

      INFO: ISPN000078: Starting JGroups Channel

      Sep 29, 2014 3:08:46 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS984-56770|19] [LOUAPPWPS984-56770, LOUAPPWPS983-54866]

      Sep 29, 2014 3:08:47 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport startJGroupsChannelIfNeeded

      INFO: ISPN000079: Cache local address is LOUAPPWPS983-54866, physical addresses are [133.27.18.204:7800]

      Sep 29, 2014 3:08:47 AM org.infinispan.factories.GlobalComponentRegistry start

      INFO: ISPN000128: Infinispan version: Infinispan 'Delirium' 5.2.6.Final

      Sep 29, 2014 3:08:47 AM org.infinispan.factories.TransactionManagerFactory construct

      INFO: ISPN000161: Using a batchMode transaction manager

      Sep 29, 2014 3:08:47 AM org.infinispan.jmx.CacheJmxRegistration start

      INFO: ISPN000031: MBeans were successfully registered to the platform MBean server.

      Sep 29, 2014 3:12:47 AM org.apache.catalina.core.ApplicationContext log

      SEVERE: StandardWrapper.Throwable

      org.infinispan.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete() throws java.lang.InterruptedException on object of type StateTransferManagerImpl

          at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205)

          at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:886)

          at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:657)

          at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:646)

          at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:549)

          at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:217)

          at org.infinispan.CacheImpl.start(CacheImpl.java:582)

          at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:686)

          at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:649)

          at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:545)

          at org.infinispan.rest.StartupListener$$anonfun$init$1.apply(StartupListener.scala:66)

          at org.infinispan.rest.StartupListener$$anonfun$init$1.apply(StartupListener.scala:65)

          at scala.collection.Iterator$class.foreach(Iterator.scala:727)

          at scala.collection.AbstractIterator.foreach(Iterator.scala:1156)

          at org.infinispan.rest.StartupListener.init(StartupListener.scala:65)

          at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1280)

          at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1193)

          at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:1088)

          at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:5176)

          at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5460)

          at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)

          at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)

          at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)

          at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:633)

          at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1120)

          at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1678)

          at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)

          at java.util.concurrent.FutureTask.run(Unknown Source)

          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

          at java.lang.Thread.run(Unknown Source)

      Caused by: org.infinispan.CacheException: Initial state transfer timed out for cache eu.ysoft.safeq.core.cache.entity.CacheableJobInfo_index on LOUAPPWPS983-54866

          at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:216)

          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

          at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

          at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

          at java.lang.reflect.Method.invoke(Unknown Source)

          at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:203)

          ... 30 more

      ....

       

      Followed by :

       

      Sep 29, 2014 4:29:06 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS983-54866|20] [LOUAPPWPS983-54866]

      Sep 29, 2014 4:29:07 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:10 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:12 AM org.infinispan.factories.TransactionManagerFactory construct

      INFO: ISPN000161: Using a batchMode transaction manager

      Sep 29, 2014 4:29:12 AM org.infinispan.jmx.CacheJmxRegistration start

      INFO: ISPN000031: MBeans were successfully registered to the platform MBean server.

      Sep 29, 2014 4:29:13 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:16 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:19 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:22 AM org.jgroups.logging.JDKLogImpl warn

       

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:46 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:49 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:52 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      ...

      Sep 29, 2014 4:30:44 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:30:47 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:30:50 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:30:52 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: LOUAPPWPS983-54866: no physical address for LOUAPPWPS984-56770, dropping message

      Sep 29, 2014 4:30:52 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: LOUAPPWPS983-54866: no physical address for LOUAPPWPS984-56770, dropping message

      Sep 29, 2014 4:30:53 AM org.jgroups.logging.JDKLogImpl warn

       

      ....

      still repeating

       

       

      Logs from node2:

      The second node was not restarted:

       

      Sep 29, 2014 3:05:21 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS984-56770|15] [LOUAPPWPS984-56770, LOUAPPWPS983-56765]

      Sep 29, 2014 3:06:38 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000093: Received new, MERGED cluster view: MergeView::[LOUAPPWPS983-56765|17] [LOUAPPWPS983-56765, LOUAPPWPS984-56770], subgroups=[LOUAPPWPS984-56770|15] [LOUAPPWPS984-56770], [LOUAPPWPS983-56765|16] [LOUAPPWPS983-56765]

      Sep 29, 2014 3:07:33 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS984-56770|18] [LOUAPPWPS984-56770]

      Sep 29, 2014 3:07:33 AM org.infinispan.topology.ClusterTopologyManagerImpl handleNewView

      ERROR: ISPN000196: Failed to recover cluster state after the current node became the coordinator

      java.lang.IllegalStateException: Trying to set a topology with invalid members for cache eu.ysoft.safeq.core.cache.entity.CacheableJobInfo_index: members = [LOUAPPWPS984-56770], topology = CacheTopology{id=37, currentCH=DefaultConsistentHash{numSegments=60, numOwners=1, members=[LOUAPPWPS984-56770, LOUAPPWPS983-56765]}, pendingCH=null}

          at org.infinispan.topology.ClusterCacheStatus.updateCacheTopology(ClusterCacheStatus.java:165)

          at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheStatusAfterMerge(ClusterTopologyManagerImpl.java:315)

          at org.infinispan.topology.ClusterTopologyManagerImpl.handleNewView(ClusterTopologyManagerImpl.java:236)

          at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.handleViewChange(ClusterTopologyManagerImpl.java:579)

          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

          at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

          at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

          at java.lang.reflect.Method.invoke(Unknown Source)

          at org.infinispan.notifications.AbstractListenerImpl$ListenerInvocation$1.run(AbstractListenerImpl.java:212)

          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

          at java.lang.Thread.run(Unknown Source)

       

      Sep 29, 2014 3:08:46 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS984-56770|19] [LOUAPPWPS984-56770, LOUAPPWPS983-54866]

      Sep 29, 2014 3:12:48 AM org.infinispan.statetransfer.StateProviderImpl getCacheTopology

      WARN: ISPN000211: Transactions were requested by node LOUAPPWPS983-54866 with topology 30, older than the local topology (35)

      Sep 29, 2014 3:12:48 AM org.infinispan.statetransfer.StateProviderImpl getCacheTopology

      WARN: ISPN000212: Segments were requested by node LOUAPPWPS983-54866 with topology 30, older than the local topology (35)

      Sep 29, 2014 12:42:27 PM org.infinispan.transaction.TransactionTable shutDownGracefully

      WARN: ISPN000100: Stopping, but there are 0 local transactions and 16 remote transactions that did not finish in time.

      Sep 29, 2014 12:42:27 PM org.infinispan.remoting.transport.jgroups.JGroupsTransport stop

      INFO: ISPN000080: Disconnecting and closing JGroups Channel

      Sep 29, 2014 12:42:29 PM org.jgroups.logging.JDKLogImpl warn

      WARNING: LOUAPPWPS984-56770: failed to collect all ACKs (expected=1) for view [LOUAPPWPS983-54866|20] after 2000ms, missing ACKs from [LOUAPPWPS983-54866]

      Sep 29, 2014 12:42:30 PM org.jgroups.logging.JDKLogImpl error

      SEVERE: couldn't deliver OOB message [dst: LOUAPPWPS984-56770, src: LOUAPPWPS983-54866 (3 headers), size=1396 bytes, flags=OOB|DONT_BUNDLE]

      java.lang.IllegalStateException: channel is not connected

          at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:615)

          at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:544)

          at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:391)

          at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:249)

          at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:598)

          at org.jgroups.JChannel.up(JChannel.java:707)

          at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1020)

          at org.jgroups.protocols.RSVP.up(RSVP.java:188)

          at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)

          at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)

          at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)

          at org.jgroups.protocols.pbcast.GMS.up(GMS.java:896)

          at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:245)

          at org.jgroups.protocols.UNICAST2.handleDataReceived(UNICAST2.java:765)

          at org.jgroups.protocols.UNICAST2.up(UNICAST2.java:420)

          at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:606)

          at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:143)

          at org.jgroups.protocols.FD.up(FD.java:253)

          at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)

          at org.jgroups.protocols.MERGE2.up(MERGE2.java:205)

          at org.jgroups.protocols.Discovery.up(Discovery.java:359)

          at org.jgroups.protocols.TP.passMessageUp(TP.java:1263)

          at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1825)

          at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1798)

          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

          at java.lang.Thread.run(Unknown Source)

       

       

       

      Until the node 1 receives new view in 4:29, where it is the only node in cluster, the node1 cannot read or write data to infinispan cache.

      After 4:29 is uses, but the data are not replicated, as the cluster is probably disconnected.

       

      Could someone help me what is the reason, why after restart of  node1 in 3:08 to 4:29 the cluster is not working?

      And how to prevent/resolve this issue?

       

      Full logs and infinispan-rest-server configuration are attached.

       

      Thanks

      Marta

        • 1. Re: Node cannot join cluster correctly after restart
          wdfink

          You should attach the jgroups configuration as well.

          Did you use client/server or library mode and which version it is?

          Also a bit more details about your environment might help.

          • 2. Re: Node cannot join cluster correctly after restart
            maruta.s

            I have attatched the jgroups configuration for both nodes.

            We are using package infinispan-server-rest version 5.2.6.Final.

            This package is deployed to Tomcat as rest service.

            If youu need some further info, ask me.

             

            Marta

            • 3. Re: Node cannot join cluster correctly after restart
              rvansa

              The problem obviously starts when the first state transfer times out - you should increase the initial state transfer timeout or disable that completely.

               

              We don't have any info why the cluster membership changes, but I am somewhat surprised to see two members on the same machine: LOUAPPWPS983-54866 and LOUAPPWPS983-56765. It seems like there's some ghost of the former instance.


              Anyway, there have been many bug fixes of state transfer and consistence when the cluster is restarting - I'd suggest you upgrade to at least 6.0.1.

              • 4. Re: Node cannot join cluster correctly after restart
                maruta.s

                There are no cluster membership changes in this case, the reason for different logical names of the channel is that we do not set it and in this case, jgroups generates one from hostname and some random number on creation.

                So after restart the logical names are different, but this is not a problem.

                • 5. Re: Node cannot join cluster correctly after restart
                  manojdunna

                  belaban


                  Node not able to rejoin the cluster after restart

                   

                  Hi I had the same problem.

                   

                  I configured my cluster in Shared Store with UniCast !

                   

                  When I Initially start both the nodes, and accessed nodes individually, if I update anything on Node1, it is propagating onto Node2 and vice-versa.

                  but If I restart any of the node it is not able to join the cluster,

                  Eg : If I restart Node2, If I perform any operation on Node1, it is not getting propagated to Node2,

                        If I update anything on Node2, Node2 itself is not getting updated, but it is propagating onto Node1

                  I am attaching my standalone-full_ha.xml of both the nodes

                   

                  Workaround 1 : If I switch to Replication mode instead of UniCast everything is working fine.

                   

                  Workaround 2 : If I clear the journal directory before restarting the second node, in this case also it is working fine.

                   

                  Attached are the logs when I restart node2, which not able to join the cluter