3 Replies Latest reply on Apr 9, 2019 8:32 AM by dan.berindei

Infinispan node cannot join cache after crash

nsahattchiev Mar 20, 2019 11:49 AM

Hi,

we have a distributed Infinispan cache running in 4 nodes and with global state and persistence:

<global-state>

    <persistent-location path="rocksdb/${localNodeId}/persistent" />
    <shared-persistent-location path="rocksdb/${localNodeId}/shared"/>
    <temporary-location path="rocksdb/${localNodeId}/tmp"/>
    <overlay-configuration-storage />
</global-state>
.....

<distributed-cache name="EiwoDistributedCache" mode="SYNC" remote-timeout="300000" owners="2" segments="100">
    <locking concurrency-level="1000" acquire-timeout="60000"/>
    <transaction mode="NONE"/>

    <persistence passivation="false">
        <rocksdbStore:rocksdb-store preload="true" fetch-state="true" path="rocksdb/${localNodeId}/data/">
            <rocksdbStore:expiration path="rocksdb/${localNodeId}/expired/"/>
        </rocksdbStore:rocksdb-store>
    </persistence>
    <indexing index="NONE"/>

    <state-transfer timeout="120000" await-initial-transfer="true"></state-transfer>
</distributed-cache>

After an out-of-memory in node 2 the whole cluster was in an unstable state and we tried to restart it. Nodes 1, 2 and 4 could be started without any issues, but node 3 failed always with the following exception:

org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl

at org.infinispan.commons.util.SecurityActions.lambda$invokeAccessibly$0(SecurityActions.java:83)

at org.infinispan.commons.util.SecurityActions.doPrivileged(SecurityActions.java:71)

at org.infinispan.commons.util.SecurityActions.invokeAccessibly(SecurityActions.java:76)

at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:185)

at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:968)

at org.infinispan.factories.AbstractComponentRegistry.lambda$invokePrioritizedMethods$6(AbstractComponentRegistry.java:703)

at org.infinispan.factories.SecurityActions.lambda$run$1(SecurityActions.java:72)

at org.infinispan.security.Security.doPrivileged(Security.java:44)

at org.infinispan.factories.SecurityActions.run(SecurityActions.java:71)

at org.infinispan.factories.AbstractComponentRegistry.invokePrioritizedMethods(AbstractComponentRegistry.java:696)

at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:689)

at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:607)

at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:244)

at org.infinispan.cache.impl.CacheImpl.start(CacheImpl.java:1051)

at org.infinispan.cache.impl.AbstractDelegatingCache.start(AbstractDelegatingCache.java:421)

at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:646)

at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:591)

at org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:477)

at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:463)

at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:449)

..........

Caused by: org.infinispan.topology.CacheJoinException: ISPN000410: Node eiwopoc-14554 attempting to join cache EiwoDistributedCache with incompatible state

at org.infinispan.topology.ClusterCacheStatus.addMember(ClusterCacheStatus.java:233)

at org.infinispan.topology.ClusterCacheStatus.doJoin(ClusterCacheStatus.java:692)

at org.infinispan.topology.ClusterTopologyManagerImpl.handleJoin(ClusterTopologyManagerImpl.java:212)

at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:178)

at org.infinispan.topology.CacheTopologyControlCommand.invokeAsync(CacheTopologyControlCommand.java:160)

at org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.invokeReplicableCommand(GlobalInboundInvocationHandler.java:169)

at org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.runReplicableCommand(GlobalInboundInvocationHandler.java:150)

at org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler.lambda$handleReplicableCommand$1(GlobalInboundInvocationHandler.java:144)

at org.infinispan.util.concurrent.BlockingTaskAwareExecutorServiceImpl$RunnableWrapper.run(BlockingTaskAwareExecutorServiceImpl.java:212)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

How can we get it up and running again without losing any data? We use Infinispan version 9.3.6.Final.

Regards

Nikolai

1. Re: Infinispan node cannot join cache after crash

rhn-support-abhati Mar 21, 2019 5:19 AM (in response to nsahattchiev)

Hi,

Which version of Infinispan are you using ?
I suspect that the node was not shut down properly which might have caused the state transfer to timeout.
Can you attach your configuration files along with server logs for all servers for detailed analysis.
Actions
2. Re: Infinispan node cannot join cache after crash

nsahattchiev Mar 21, 2019 5:48 AM (in response to rhn-support-abhati)

Hi, you can find the version and configuration in my first post. After the out of memory problem the cluster was not stable (nodes could not see each other, we had a lot of jgroups timeout exceptions and so on). Therefore we have shutdown the whole cluster and tried to start it again. All nodes, except node 3, could be started successfully.

The logs are very huge and it will be difficult and very time consuming to analyse them.

I just try to figure out how to start node 3 again.

This happened on a pre-prod system and we would like to know, how to get the cluster back in normal state.
Actions
3. Re: Infinispan node cannot join cache after crash

dan.berindei Apr 9, 2019 8:32 AM (in response to nsahattchiev)

The failing node should start normally after you delete the persistent-location directory.
Actions

Go to original post