3 Replies Latest reply on Apr 9, 2019 8:32 AM by dan.berindei

    Infinispan node cannot join cache after crash

    nsahattchiev

      Hi,

       

      we have a distributed Infinispan cache running in 4 nodes and with global state and persistence:

       

      <global-state>

          <persistent-location path="rocksdb/${localNodeId}/persistent" />
          <shared-persistent-location path="rocksdb/${localNodeId}/shared"/>
          <temporary-location path="rocksdb/${localNodeId}/tmp"/>
          <overlay-configuration-storage />
      </global-state>
      .....
      <distributed-cache name="EiwoDistributedCache" mode="SYNC" remote-timeout="300000" owners="2" segments="100">
          <locking concurrency-level="1000" acquire-timeout="60000"/>
          <transaction mode="NONE"/>

          <persistence passivation="false">
              <rocksdbStore:rocksdb-store preload="true" fetch-state="true" path="rocksdb/${localNodeId}/data/">
                  <rocksdbStore:expiration path="rocksdb/${localNodeId}/expired/"/>
              </rocksdbStore:rocksdb-store>
          </persistence>
          <indexing index="NONE"/>

          <state-transfer timeout="120000" await-initial-transfer="true"></state-transfer>
      </distributed-cache>

       

      After an out-of-memory in node 2 the whole cluster was in an unstable state and we tried to restart it. Nodes 1, 2 and 4 could be started without any issues, but node 3 failed always with the following exception:

       

      org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl

           at org.infinispan.commons.util.SecurityActions.lambda$invokeAccessibly$0(SecurityActions.java:83)

           at org.infinispan.commons.util.SecurityActions.doPrivileged(SecurityActions.java:71)

           at org.infinispan.commons.util.SecurityActions.invokeAccessibly(SecurityActions.java:76)

           at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:185)

           at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:968)

           at org.infinispan.factories.AbstractComponentRegistry.lambda$invokePrioritizedMethods$6(AbstractComponentRegistry.java:703)

           at org.infinispan.factories.SecurityActions.lambda$run$1(SecurityActions.java:72)

           at org.infinispan.security.Security.doPrivileged(Security.java:44)

           at org.infinispan.factories.SecurityActions.run(SecurityActions.java:71)

           at org.infinispan.factories.AbstractComponentRegistry.invokePrioritizedMethods(AbstractComponentRegistry.java:696)

           at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:689)

                at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:607)

           at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:244)

           at org.infinispan.cache.impl.CacheImpl.start(CacheImpl.java:1051)

           at org.infinispan.cache.impl.AbstractDelegatingCache.start(AbstractDelegatingCache.java:421)

           at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:646)

           at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:591)

           at org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:477)

           at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:463)

           at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:449)

           ..........

      Caused by: org.infinispan.topology.CacheJoinException: ISPN000410: Node eiwopoc-14554 attempting to join cache EiwoDistributedCache with incompatible state

           at org.infinispan.topology.ClusterCacheStatus.addMember(ClusterCacheStatus.java:233)

           at org.infinispan.topology.ClusterCacheStatus.doJoin(ClusterCacheStatus.java:692)

           at org.infinispan.topology.ClusterTopologyManagerImpl.handleJoin(ClusterTopologyManagerImpl.java:212)

           at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:178)

           at org.infinispan.topology.CacheTopologyControlCommand.invokeAsync(CacheTopologyControlCommand.java:160)

           at org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.invokeReplicableCommand(GlobalInboundInvocationHandler.java:169)

           at org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.runReplicableCommand(GlobalInboundInvocationHandler.java:150)

           at org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.lambda$handleReplicableCommand$1(GlobalInboundInvocationHandler.java:144)

           at org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:212)

           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

           at java.lang.Thread.run(Thread.java:748)

       

      How can we get it up and running again without losing any data? We use Infinispan version 9.3.6.Final.

       

      Regards

      Nikolai

        • 1. Re: Infinispan node cannot join cache after crash
          rhn-support-abhati

          Hi,

           

          Which version of Infinispan are you using ?

          I suspect that the node was not shut down properly which might have caused the state transfer to timeout.

          Can you attach your configuration files along with server logs for all servers for detailed analysis.

          • 2. Re: Infinispan node cannot join cache after crash
            nsahattchiev

            Hi, you can find the version and configuration in my first post. After the out of memory problem the cluster was not stable (nodes could not see each other, we had a lot of jgroups timeout exceptions and so on). Therefore we have shutdown the whole cluster and tried to start it again. All nodes, except node 3, could be started successfully.

             

            The logs are very huge and it will be difficult and very time consuming to analyse them.

             

            I just try to figure out how to start node 3 again.

             

            This happened on a pre-prod system and we would like to know, how to get the cluster back in normal state.

            • 3. Re: Infinispan node cannot join cache after crash
              dan.berindei

              The failing node should start normally after you delete the persistent-location directory.