11 Replies Latest reply on Jun 1, 2018 11:29 AM by sivarni2003

Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

enrico.olivelli Nov 17, 2014 3:31 AM

Hi,

here is my scenario.

I have a only one DIST_SYNC cache, most of the JVM in the cluster are configured with capacityFactor = 0 (like the distibutedlocalstorage=false property of Coherence) and some node are configured with capacityFactor>0 (for instance 1000). We are talking about 100 nodes with capacityFactor=0 and 4 nodes of the other kind, al the cluster is indide one single "site/rack". Partition Handling is off, numOwners is 1.

When all the nodes with capacityFactor > 0 are down the cluster comes to a degraded state from which it cannot recover anymore without a full cluster restart.

If I enable partition-handling AvailablyExceptions start to be throw and I think is the expected behaviour (Infinispan User Guide).

I think this is the problem and it is a bug:

14/11/17 09:27:25 WARN topology.CacheTopologyControlCommand: ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=shared, type=JOIN, sender=testserver1@xxxxxxx-22311, site-id=xxx, rack-id=xxx, machine-id=24 bytes, joinInfo=CacheJoinInfo{consistentHashFactory=org.infinispan.distribution.ch.impl.TopologyAwareConsistentHashFactory@78b791ef, hashFunction=MurmurHash3, numSegments=60, numOwners=1, timeout=120000, totalOrder=false, distributed=true}, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, throwable=null, viewId=3}

java.lang.IllegalArgumentException: A cache topology's pending consistent hash must contain all the current consistent hash's members

at org.infinispan.topology.CacheTopology.<init>(CacheTopology.java:48)

at org.infinispan.topology.CacheTopology.<init>(CacheTopology.java:43)

at org.infinispan.topology.ClusterCacheStatus.startQueuedRebalance(ClusterCacheStatus.java:631)

at org.infinispan.topology.ClusterCacheStatus.queueRebalance(ClusterCacheStatus.java:85)

at org.infinispan.partionhandling.impl.PreferAvailabilityStrategy.onJoin(PreferAvailabilityStrategy.java:22)

at org.infinispan.topology.ClusterCacheStatus.doJoin(ClusterCacheStatus.java:540)

at org.infinispan.topology.ClusterTopologyManagerImpl.handleJoin(ClusterTopologyManagerImpl.java:123)

at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:158)

at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:140)

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:278)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

After that error every "put" results in:

14/11/17 09:27:27 ERROR interceptors.InvocationContextInterceptor: ISPN000136: Execution error

org.infinispan.util.concurrent.TimeoutException: Timed out waiting for topology 1

at org.infinispan.statetransfer.StateTransferLockImpl.waitForTransactionData(StateTransferLockImpl.java:93)

at org.infinispan.interceptors.base.BaseStateTransferInterceptor.waitForTransactionData(BaseStateTransferInterceptor.java:96)

at org.infinispan.statetransfer.StateTransferInterceptor.handleNonTxWriteCommand(StateTransferInterceptor.java:188)

at org.infinispan.statetransfer.StateTransferInterceptor.visitPutKeyValueCommand(StateTransferInterceptor.java:95)

at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:71)

at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98)

at org.infinispan.interceptors.CacheMgmtInterceptor.updateStoreStatistics(CacheMgmtInterceptor.java:148)

at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:134)

at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:71)

at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98)

at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:102)

at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:71)

at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:35)

at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:71)

at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:333)

at org.infinispan.cache.impl.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1576)

at org.infinispan.cache.impl.CacheImpl.putInternal(CacheImpl.java:1054)

at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:1046)

at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:1646)

at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:245)

This is the actual configuration:

GlobalConfiguration globalConfig = new GlobalConfigurationBuilder()

.globalJmxStatistics()

.allowDuplicateDomains(true)

.cacheManagerName(instanceName)

.transport()

.defaultTransport()

.clusterName(clustername)

.addProperty("configurationFile", configurationFile) (udp for my cluster, approx 100 machines)

.machineId(instanceName)

.siteId("site1")

.rackId("rack1")

.nodeName(serviceName + "@" + instanceName)

.remoteCommandThreadPool().threadPoolFactory(CachedThreadPoolExecutorFactory.create())

.build();

Configuration wildcard = new ConfigurationBuilder()

.locking().lockAcquisitionTimeout(lockAcquisitionTimeout)

.concurrencyLevel(10000).isolationLevel(IsolationLevel.READ_COMMITTED).useLockStriping(true)

.clustering()

.cacheMode(CacheMode.DIST_SYNC)

.l1().lifespan(l1ttl)

.hash().numOwners(numOwners).capacityFactor(capacityFactor)

.partitionHandling().enabled(false)

.stateTransfer().awaitInitialTransfer(false).timeout(initialTransferTimeout).fetchInMemoryState(false)

.storeAsBinary().enabled(true).storeKeysAsBinary(false).storeValuesAsBinary(true)

.jmxStatistics().enable()

.unsafe().unreliableReturnValues(true)

.build();

Should I report a bug in JIRA ?

One workaround is to set capacityFactor = 1 instead of 0, but I do not want "simple-nodes" (with less RAM) to becaome key-owners

For me this is a shostopper problem

1. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

wdfink Nov 17, 2014 2:58 PM (in response to enrico.olivelli)

From what I understand this will not be a bug.

If you set capacityFactor=0 this means this node is a client which has no data, all get's will be end in looking up remote, all put's try to store the entity somewhere.
Now if you stop all nodes with capacityFactor>0 this means the cache is not able to store a value neither to get any value. All data is lost.
So you need to have at least one active node with a capacityFactor>0

Does that makes sense?
Actions
2. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

an1310 Nov 18, 2014 10:02 AM (in response to enrico.olivelli)

Do I understand this correctly -- the cluster doesn't recover when you re-add a node with a capacityFactor > 0? If so, Wolf-Dieter Fink's reply explains the behavior.
Actions
3. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

enrico.olivelli Nov 18, 2014 10:07 AM (in response to an1310)

Yes, my problem is that even if capacityFactor>0 nodes are re-added to the cluster the system can't get back to a stable status.
Actions
4. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

an1310 Nov 18, 2014 10:13 AM (in response to enrico.olivelli)

Yes, that almost assuredly is a bug. If I have some time in the next few days, I could look at it if the team is busy (since I need that functionality too)
Actions
5. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

wdfink Nov 19, 2014 7:25 AM (in response to an1310)

I would expect that all data is lost if you have no node with capacityFactor>0, but after adding such node the cluster should be able to recover and start caching.
If not please file a JIRA here Infinispan - JBoss Issue Tracker
Actions
6. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

enrico.olivelli Nov 19, 2014 8:30 AM (in response to wdfink)

[ISPN-4996] Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0 - JBoss Issue Tracker
Actions
7. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

enrico.olivelli Nov 20, 2014 5:09 AM (in response to enrico.olivelli)

Since for us is a blocker problem can someone evaluate the issue and schedule it for some release ?
it would be very nice to have it fixed in a possible 7.0.3 release

thanks
Actions
8. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

wdfink Nov 21, 2014 5:59 AM (in response to enrico.olivelli)

Hello Enrico,

Infinispan is the community version, there is no SLA or guarantee that fixes will be available soon.
If you have such requirements I would recommend to have a subscription and use JDG which is the Enterprise Product based on Infinispan.
Here you will have bugfixes and security updates for years with the version you use. Also hot fixes are provided if you have problems in production or you are blocked.
Actions
9. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

sivarni2003 Jun 1, 2018 3:11 AM (in response to wdfink)

Hi,

May I know this issue is fixed or not in further releases? Me too facing this same issue as the cache unable to recover after re-adding capacity > 0 node. This is a major setback for us. Any help is appreciated. Thanks
Actions
10. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

rvansa Jun 1, 2018 5:40 AM (in response to sivarni2003)

https://issues.jboss.org/browse/ISPN-4996 was not fixed according to JIRA; it might have been fixed along with some other changes, though. The best way to confirm is if you could write a test case (using most recent Infinispan release ofc).
Actions
11. Re: Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

sivarni2003 Jun 1, 2018 11:29 AM (in response to rvansa)

Thanks. I am using version 8.1.3 and the problem still persists. Yet to check with latest.
Actions

Go to original post