5 Replies Latest reply on Jun 15, 2016 12:59 PM by Manoj Dunna

    Node cannot join cluster correctly after restart

    Marta Sedlakova Newbie

      Hello,

      we are facing problem, that after some unexpected server restart, node is not able to join the cluster again.

      Our environment is 2 node cluster (infinispan rest servers). One node1 (server) was restarted and after that it cannot join the cluster until second node (node2) was restarted:

       

      Logs from node1:

      After restart there are many errors regarding transactions:

       

      Sep 29, 2014 3:03:53 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport stop

      INFO: ISPN000080: Disconnecting and closing JGroups Channel

      Sep 29, 2014 3:03:54 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport stop

      INFO: ISPN000082: Stopping the RpcDispatcher

      Sep 29, 2014 3:05:20 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport start

      INFO: ISPN000078: Starting JGroups Channel

      Sep 29, 2014 3:05:21 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS984-56770|15] [LOUAPPWPS984-56770, LOUAPPWPS983-56765]

      Sep 29, 2014 3:05:21 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport startJGroupsChannelIfNeeded

      INFO: ISPN000079: Cache local address is LOUAPPWPS983-56765, physical addresses are [133.27.18.204:7800]

      Sep 29, 2014 3:05:21 AM org.infinispan.factories.GlobalComponentRegistry start

      INFO: ISPN000128: Infinispan version: Infinispan 'Delirium' 5.2.6.Final

      Sep 29, 2014 3:05:22 AM org.infinispan.factories.TransactionManagerFactory construct

      INFO: ISPN000161: Using a batchMode transaction manager

      Sep 29, 2014 3:05:22 AM org.infinispan.jmx.CacheJmxRegistration start

      INFO: ISPN000031: MBeans were successfully registered to the platform MBean server.

      Sep 29, 2014 3:05:23 AM org.infinispan.factories.TransactionManagerFactory construct

      INFO: ISPN000161: Using a batchMode transaction manager

      Sep 29, 2014 3:05:23 AM org.infinispan.jmx.CacheJmxRegistration start

      INFO: ISPN000031: MBeans were successfully registered to the platform MBean server.

      Sep 29, 2014 3:05:23 AM org.infinispan.factories.TransactionManagerFactory construct

      INFO: ISPN000161: Using a batchMode transaction manager

      Sep 29, 2014 3:05:23 AM org.infinispan.jmx.CacheJmxRegistration start

      INFO: ISPN000031: MBeans were successfully registered to the platform MBean server.

      Sep 29, 2014 3:06:18 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS983-56765|16] [LOUAPPWPS983-56765]

      Sep 29, 2014 3:06:38 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000093: Received new, MERGED cluster view: MergeView::[LOUAPPWPS983-56765|17] [LOUAPPWPS983-56765, LOUAPPWPS984-56770], subgroups=[LOUAPPWPS984-56770|15] [LOUAPPWPS984-56770], [LOUAPPWPS983-56765|16] [LOUAPPWPS983-56765]

      Sep 29, 2014 3:07:33 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport stop

      INFO: ISPN000080: Disconnecting and closing JGroups Channel

      Sep 29, 2014 3:07:33 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport stop

      INFO: ISPN000082: Stopping the RpcDispatcher

      Sep 29, 2014 3:08:45 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport start

      INFO: ISPN000078: Starting JGroups Channel

      Sep 29, 2014 3:08:46 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS984-56770|19] [LOUAPPWPS984-56770, LOUAPPWPS983-54866]

      Sep 29, 2014 3:08:47 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport startJGroupsChannelIfNeeded

      INFO: ISPN000079: Cache local address is LOUAPPWPS983-54866, physical addresses are [133.27.18.204:7800]

      Sep 29, 2014 3:08:47 AM org.infinispan.factories.GlobalComponentRegistry start

      INFO: ISPN000128: Infinispan version: Infinispan 'Delirium' 5.2.6.Final

      Sep 29, 2014 3:08:47 AM org.infinispan.factories.TransactionManagerFactory construct

      INFO: ISPN000161: Using a batchMode transaction manager

      Sep 29, 2014 3:08:47 AM org.infinispan.jmx.CacheJmxRegistration start

      INFO: ISPN000031: MBeans were successfully registered to the platform MBean server.

      Sep 29, 2014 3:12:47 AM org.apache.catalina.core.ApplicationContext log

      SEVERE: StandardWrapper.Throwable

      org.infinispan.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete() throws java.lang.InterruptedException on object of type StateTransferManagerImpl

          at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205)

          at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:886)

          at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:657)

          at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:646)

          at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:549)

          at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:217)

          at org.infinispan.CacheImpl.start(CacheImpl.java:582)

          at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:686)

          at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:649)

          at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:545)

          at org.infinispan.rest.StartupListener$$anonfun$init$1.apply(StartupListener.scala:66)

          at org.infinispan.rest.StartupListener$$anonfun$init$1.apply(StartupListener.scala:65)

          at scala.collection.Iterator$class.foreach(Iterator.scala:727)

          at scala.collection.AbstractIterator.foreach(Iterator.scala:1156)

          at org.infinispan.rest.StartupListener.init(StartupListener.scala:65)

          at org.apache.catalina.core.StandardWrapper.initServlet(StandardWrapper.java:1280)

          at org.apache.catalina.core.StandardWrapper.loadServlet(StandardWrapper.java:1193)

          at org.apache.catalina.core.StandardWrapper.load(StandardWrapper.java:1088)

          at org.apache.catalina.core.StandardContext.loadOnStartup(StandardContext.java:5176)

          at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5460)

          at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)

          at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)

          at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)

          at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:633)

          at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1120)

          at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1678)

          at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)

          at java.util.concurrent.FutureTask.run(Unknown Source)

          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

          at java.lang.Thread.run(Unknown Source)

      Caused by: org.infinispan.CacheException: Initial state transfer timed out for cache eu.ysoft.safeq.core.cache.entity.CacheableJobInfo_index on LOUAPPWPS983-54866

          at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:216)

          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

          at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

          at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

          at java.lang.reflect.Method.invoke(Unknown Source)

          at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:203)

          ... 30 more

      ....

       

      Followed by :

       

      Sep 29, 2014 4:29:06 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS983-54866|20] [LOUAPPWPS983-54866]

      Sep 29, 2014 4:29:07 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:10 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:12 AM org.infinispan.factories.TransactionManagerFactory construct

      INFO: ISPN000161: Using a batchMode transaction manager

      Sep 29, 2014 4:29:12 AM org.infinispan.jmx.CacheJmxRegistration start

      INFO: ISPN000031: MBeans were successfully registered to the platform MBean server.

      Sep 29, 2014 4:29:13 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:16 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:19 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:22 AM org.jgroups.logging.JDKLogImpl warn

       

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:46 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:49 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:29:52 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      ...

      Sep 29, 2014 4:30:44 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:30:47 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:30:50 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: I was suspected by LOUAPPWPS984-56770; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK

      Sep 29, 2014 4:30:52 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: LOUAPPWPS983-54866: no physical address for LOUAPPWPS984-56770, dropping message

      Sep 29, 2014 4:30:52 AM org.jgroups.logging.JDKLogImpl warn

      WARNING: LOUAPPWPS983-54866: no physical address for LOUAPPWPS984-56770, dropping message

      Sep 29, 2014 4:30:53 AM org.jgroups.logging.JDKLogImpl warn

       

      ....

      still repeating

       

       

      Logs from node2:

      The second node was not restarted:

       

      Sep 29, 2014 3:05:21 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS984-56770|15] [LOUAPPWPS984-56770, LOUAPPWPS983-56765]

      Sep 29, 2014 3:06:38 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000093: Received new, MERGED cluster view: MergeView::[LOUAPPWPS983-56765|17] [LOUAPPWPS983-56765, LOUAPPWPS984-56770], subgroups=[LOUAPPWPS984-56770|15] [LOUAPPWPS984-56770], [LOUAPPWPS983-56765|16] [LOUAPPWPS983-56765]

      Sep 29, 2014 3:07:33 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS984-56770|18] [LOUAPPWPS984-56770]

      Sep 29, 2014 3:07:33 AM org.infinispan.topology.ClusterTopologyManagerImpl handleNewView

      ERROR: ISPN000196: Failed to recover cluster state after the current node became the coordinator

      java.lang.IllegalStateException: Trying to set a topology with invalid members for cache eu.ysoft.safeq.core.cache.entity.CacheableJobInfo_index: members = [LOUAPPWPS984-56770], topology = CacheTopology{id=37, currentCH=DefaultConsistentHash{numSegments=60, numOwners=1, members=[LOUAPPWPS984-56770, LOUAPPWPS983-56765]}, pendingCH=null}

          at org.infinispan.topology.ClusterCacheStatus.updateCacheTopology(ClusterCacheStatus.java:165)

          at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheStatusAfterMerge(ClusterTopologyManagerImpl.java:315)

          at org.infinispan.topology.ClusterTopologyManagerImpl.handleNewView(ClusterTopologyManagerImpl.java:236)

          at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.handleViewChange(ClusterTopologyManagerImpl.java:579)

          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

          at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

          at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

          at java.lang.reflect.Method.invoke(Unknown Source)

          at org.infinispan.notifications.AbstractListenerImpl$ListenerInvocation$1.run(AbstractListenerImpl.java:212)

          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

          at java.lang.Thread.run(Unknown Source)

       

      Sep 29, 2014 3:08:46 AM org.infinispan.remoting.transport.jgroups.JGroupsTransport viewAccepted

      INFO: ISPN000094: Received new cluster view: [LOUAPPWPS984-56770|19] [LOUAPPWPS984-56770, LOUAPPWPS983-54866]

      Sep 29, 2014 3:12:48 AM org.infinispan.statetransfer.StateProviderImpl getCacheTopology

      WARN: ISPN000211: Transactions were requested by node LOUAPPWPS983-54866 with topology 30, older than the local topology (35)

      Sep 29, 2014 3:12:48 AM org.infinispan.statetransfer.StateProviderImpl getCacheTopology

      WARN: ISPN000212: Segments were requested by node LOUAPPWPS983-54866 with topology 30, older than the local topology (35)

      Sep 29, 2014 12:42:27 PM org.infinispan.transaction.TransactionTable shutDownGracefully

      WARN: ISPN000100: Stopping, but there are 0 local transactions and 16 remote transactions that did not finish in time.

      Sep 29, 2014 12:42:27 PM org.infinispan.remoting.transport.jgroups.JGroupsTransport stop

      INFO: ISPN000080: Disconnecting and closing JGroups Channel

      Sep 29, 2014 12:42:29 PM org.jgroups.logging.JDKLogImpl warn

      WARNING: LOUAPPWPS984-56770: failed to collect all ACKs (expected=1) for view [LOUAPPWPS983-54866|20] after 2000ms, missing ACKs from [LOUAPPWPS983-54866]

      Sep 29, 2014 12:42:30 PM org.jgroups.logging.JDKLogImpl error

      SEVERE: couldn't deliver OOB message [dst: LOUAPPWPS984-56770, src: LOUAPPWPS983-54866 (3 headers), size=1396 bytes, flags=OOB|DONT_BUNDLE]

      java.lang.IllegalStateException: channel is not connected

          at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:615)

          at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:544)

          at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:391)

          at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:249)

          at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:598)

          at org.jgroups.JChannel.up(JChannel.java:707)

          at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1020)

          at org.jgroups.protocols.RSVP.up(RSVP.java:188)

          at org.jgroups.protocols.FRAG2.up(FRAG2.java:181)

          at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)

          at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)

          at org.jgroups.protocols.pbcast.GMS.up(GMS.java:896)

          at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:245)

          at org.jgroups.protocols.UNICAST2.handleDataReceived(UNICAST2.java:765)

          at org.jgroups.protocols.UNICAST2.up(UNICAST2.java:420)

          at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:606)

          at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:143)

          at org.jgroups.protocols.FD.up(FD.java:253)

          at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)

          at org.jgroups.protocols.MERGE2.up(MERGE2.java:205)

          at org.jgroups.protocols.Discovery.up(Discovery.java:359)

          at org.jgroups.protocols.TP.passMessageUp(TP.java:1263)

          at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1825)

          at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1798)

          at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

          at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

          at java.lang.Thread.run(Unknown Source)

       

       

       

      Until the node 1 receives new view in 4:29, where it is the only node in cluster, the node1 cannot read or write data to infinispan cache.

      After 4:29 is uses, but the data are not replicated, as the cluster is probably disconnected.

       

      Could someone help me what is the reason, why after restart of  node1 in 3:08 to 4:29 the cluster is not working?

      And how to prevent/resolve this issue?

       

      Full logs and infinispan-rest-server configuration are attached.

       

      Thanks

      Marta