Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)
jugglingcats Sep 4, 2014 6:14 PMHi, we have a three node ISPN cluster running in production. It does around 80 cache reads per second sustained, around 20 are loaded from backing store (mongo). Most are modified and written back to the cache, with async write-behind to the store. ISPN version is 6.0.0.Final.
The cluster runs fine for a few days but then, typically under times of higher than normal load, it starts to fail badly, with all three nodes reporting errors:
org.infinispan.util.concurrent.TimeoutException: Timed out waiting for 15 seconds for valid responses from any of [Sender{address=xxxxxxx-ad3-prod-12362, responded=false}].
We can telnet from the node in question to the remote node on the jgroups port, so connectivity isn't an issue.
There is another log entry that appears much less often, and I'm not sure if it's related...
04-09-2014 16:36:16,635 WARN [INT-8,ISPN,vocento-ad2-prod-28582] org.jgroups.protocols.TCP - Discarding message because TCP send_queue is full and hasn't been releasing for 300 ms
Has anyone else seen these errors? Basically our cluster is dead at this point... 50% of responses to clients are above 15 seconds! It happens very suddenly - everything looks healthy then logs are full of errors :-(
I was slightly worried to see this bug report filed recently: https://issues.jboss.org/browse/ISPN-4615.
Our cache is configured as follows:
ConfigurationBuilder builder = new ConfigurationBuilder();
if (isClustered()) {
builder.clustering()
.cacheMode(distributedCacheMode) // DIST_ASYNC
.hash()
.numOwners(distributedCacheOwners); // 1 owner
}
builder.locking().concurrencyLevel(concurrency); // 500
if (cacheInfo.persistence()) {
builder.persistence()
.addStore(MongoStoreConfigurationBuilder.class) // a custom mongodb store implementation
.shared(true)
.mongoDatabase(db)
.jsonStore(jsonStore)
.persistentType(type)
.metricsRegistry(metrics)
.fetchPersistentState(true)
.async().enabled(true)
.modificationQueueSize(modificationQueueSize)
.threadPoolSize(threadPoolSize);
}
builder.eviction()
.maxEntries(defaultCacheSize) // 100,000
.strategy(EvictionStrategy.LIRS);
Any help tracking down the source of this issue massively appreciated, or advice on possibly tuning options to alleviate it.
Thanks!