6 Replies Latest reply on Sep 15, 2014 3:24 AM by rvansa

Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)

jugglingcats Sep 4, 2014 6:14 PM

Hi, we have a three node ISPN cluster running in production. It does around 80 cache reads per second sustained, around 20 are loaded from backing store (mongo). Most are modified and written back to the cache, with async write-behind to the store. ISPN version is 6.0.0.Final.

The cluster runs fine for a few days but then, typically under times of higher than normal load, it starts to fail badly, with all three nodes reporting errors:

org.infinispan.util.concurrent.TimeoutException: Timed out waiting for 15 seconds for valid responses from any of [Sender{address=xxxxxxx-ad3-prod-12362, responded=false}].

We can telnet from the node in question to the remote node on the jgroups port, so connectivity isn't an issue.

There is another log entry that appears much less often, and I'm not sure if it's related...

04-09-2014 16:36:16,635 WARN [INT-8,ISPN,vocento-ad2-prod-28582] org.jgroups.protocols.TCP - Discarding message because TCP send_queue is full and hasn't been releasing for 300 ms

Has anyone else seen these errors? Basically our cluster is dead at this point... 50% of responses to clients are above 15 seconds! It happens very suddenly - everything looks healthy then logs are full of errors :-(

I was slightly worried to see this bug report filed recently: https://issues.jboss.org/browse/ISPN-4615.

Our cache is configured as follows:

ConfigurationBuilder builder = new ConfigurationBuilder();

if (isClustered()) {

builder.clustering()

.cacheMode(distributedCacheMode) // DIST_ASYNC

.hash()

.numOwners(distributedCacheOwners); // 1 owner

}

builder.locking().concurrencyLevel(concurrency); // 500

if (cacheInfo.persistence()) {

builder.persistence()

.addStore(MongoStoreConfigurationBuilder.class) // a custom mongodb store implementation

.shared(true)

.mongoDatabase(db)

.jsonStore(jsonStore)

.persistentType(type)

.metricsRegistry(metrics)

.fetchPersistentState(true)

.async().enabled(true)

.modificationQueueSize(modificationQueueSize)

.threadPoolSize(threadPoolSize);

}

builder.eviction()

.maxEntries(defaultCacheSize) // 100,000

.strategy(EvictionStrategy.LIRS);

Any help tracking down the source of this issue massively appreciated, or advice on possibly tuning options to alleviate it.

Thanks!

1. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)

rvansa Sep 5, 2014 5:49 AM (in response to jugglingcats)

Add this point, most helpful for investigating that issue would be dump of all threads' stacks (use jstack or kill -3).

As wild guess, I'd ask for your JGroups configuration - do you use protocol UFC (that's not recommended with TCP I think), and if yes what's your UFC.max_credits limit?

Btw., Infinispan has some MongoDB cachestore implementation https://github.com/infinispan/infinispan-cachestore-mongodb - if your implementation is better, contributions are welcome.
Actions
2. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)

jugglingcats Sep 5, 2014 3:18 PM (in response to rvansa)

Hi Radim, appreciate the reply...

Below is our jgroups config. We use a custom ping class as we run in EC2 (no UDP) and we use Mongo to coordinate the cluster.

When we started, the Mongo cachestore implementation hadn't been updated for ISPN 6.x. Also, we have some very specific requirements about how values are persisted - we wanted to have objects readable in Mongo (for reporting, etc), and so our cachestore insists on objects being serializable by Jackson to BSON, munges property objects, uses ObjectId keys, etc, etc.

<config xmlns="urn:org:jgroups"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd">
    <TCP
            bind_addr="${jgroups.tcp.address:127.0.0.1}"
            bind_port="${jgroups.tcp.port:7800}"
            loopback="true"
            port_range="30"
            recv_buf_size="20000000"
            send_buf_size="640000"
            max_bundle_size="31k"
            use_send_queues="true"
            sock_conn_timeout="300"
            enable_diagnostics="false"

            bundler_type="old"

            thread_pool.enabled="true"
            thread_pool.min_threads="2"
            thread_pool.max_threads="30"
            thread_pool.keep_alive_time="60000"
            thread_pool.queue_enabled="false"
            thread_pool.queue_max_size="100"
            thread_pool.rejection_policy="Discard"

            oob_thread_pool.enabled="true"
            oob_thread_pool.min_threads="2"
            oob_thread_pool.max_threads="30"
            oob_thread_pool.keep_alive_time="60000"
            oob_thread_pool.queue_enabled="false"
            oob_thread_pool.queue_max_size="100"
            oob_thread_pool.rejection_policy="Discard"

            internal_thread_pool.enabled="true"
            internal_thread_pool.min_threads="1"
            internal_thread_pool.max_threads="10"
            internal_thread_pool.keep_alive_time="60000"
            internal_thread_pool.queue_enabled="true"
            internal_thread_pool.queue_max_size="100"
            internal_thread_pool.rejection_policy="Discard"
            />

    <com.evolok.ad.cache.infinispan.MongoPing/>

    <MERGE2 max_interval="30000"
            min_interval="10000"/>
    <FD_SOCK/>
    <FD timeout="3000" max_tries="5"/>
    <VERIFY_SUSPECT timeout="1500"/>
    <pbcast.NAKACK2
            use_mcast_xmit="false"
            xmit_interval="1000"
            xmit_table_num_rows="100" xmit_table_msgs_per_row="10000"
            xmit_table_max_compaction_time="10000"
            max_msg_batch_size="100"/>

    <UNICAST3
            xmit_table_num_rows="20"
            xmit_table_msgs_per_row="10000"
            xmit_table_max_compaction_time="10000"
            max_msg_batch_size="100"
            conn_expiry_timeout="0"/>
    <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
                   max_bytes="400000"/>
    <pbcast.GMS print_local_addr="false" join_timeout="7000" view_bundling="true"/>
    <tom.TOA/> 
    <UFC max_credits="2m" min_threshold="0.40"/>
    <MFC max_credits="2m" min_threshold="0.40"/>
    <FRAG2 frag_size="30k" />
    <RSVP timeout="60000" resend_interval="500" ack_on_delivery="false"/>
</config>
Actions
3. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)

rvansa Sep 8, 2014 4:17 AM (in response to jugglingcats)

Without thread dumps we can't see where the threads are stuck, but the JGroups configuration has only a decent OOB thread pool - I'd recommend raising TCP.oob_thread_pool.max_threads to 500 (usually you wouldn't have that many threads running, but in case that something goes wrong and few threads get stuck, there should be enough other threads to keep the node responsive, and eventually the problem could timeout).
Actions
4. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)

jugglingcats Sep 8, 2014 10:09 AM (in response to rvansa)

Hi Radim, I am guessing you need the thread dumps when the problem is actually happening. I will try to get some!

I am also seeing some other errors (shown below) at times. They don't coincide with the timeout issues and don't seem to affect the operation of the cluster, but they are still worrying.

These types of errors appear on all nodes, often but not always at around the same time:

08-09-2014 07:12:13,864 ERROR [OOB-5120,ISPN,vocento-ad3-prod-13816] org.jgroups.blocks.RequestCorrelator - failed unmarshalling buffer into return value
java.io.EOFException: Read past end of file
        at org.jboss.marshalling.SimpleDataInput.eofOnRead(SimpleDataInput.java:126)
        at org.jboss.marshalling.SimpleDataInput.readUnsignedByteDirect(SimpleDataInput.java:263)
        at org.jboss.marshalling.SimpleDataInput.readUnsignedByte(SimpleDataInput.java:224)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:213)
...

08-09-2014 07:12:35,249 ERROR [http-8080-80] org.infinispan.interceptors.InvocationContextInterceptor - ISPN000136: Execution error
org.infinispan.commons.CacheException: java.util.ConcurrentModificationException
        at org.infinispan.commons.util.Util.rewrapAsCacheException(Util.java:581)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:141)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:524)
        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:281)
        at org.infinispan.interceptors.distribution.BaseDistributionInterceptor.invokeClusterGetCommandRemotely(BaseDistributionInterceptor.java:129)
...
Caused by: java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
        at java.util.HashMap$EntryIterator.next(HashMap.java:834)
        at java.util.HashMap$EntryIterator.next(HashMap.java:832)
        at org.infinispan.commons.marshall.MarshallUtil.marshallMap(MarshallUtil.java:34)
        at org.infinispan.marshall.exts.MapExternalizer.writeObject(MapExternalizer.java:53)
...

08-09-2014 12:53:31,896 ERROR [http-8080-108] org.infinispan.interceptors.InvocationContextInterceptor - ISPN000136: Execution error
org.infinispan.commons.CacheException: java.io.EOFException: Read past end of file
        at org.infinispan.commons.util.Util.rewrapAsCacheException(Util.java:581)
        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:141)
        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:524)
...
Caused by: java.io.EOFException: Read past end of file
        at org.jboss.marshalling.SimpleDataInput.eofOnRead(SimpleDataInput.java:126)
        at org.jboss.marshalling.SimpleDataInput.readUnsignedByteDirect(SimpleDataInput.java:263)
        at org.jboss.marshalling.SimpleDataInput.readUnsignedByte(SimpleDataInput.java:224)
        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:213)
        at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:37)
...

Any info on tracking down the root cause of these appreciated.

Thanks, Alfie.
Actions
5. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)

jugglingcats Sep 11, 2014 10:09 AM (in response to jugglingcats)

This definitely happens when other parts of the infrastructure are struggling, eg. Mongo, although it doesn't seem to take much to tip things into a cascade effect.

We are going to increase threadpool limits, but would also like to decrease the timeout from 15 seconds to something more sensible like 1 second.

Where is this timeout configured? Is it set on transport().distributedSyncTimeout() or somewhere else?

Thanks.
Actions
6. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)

rvansa Sep 15, 2014 3:24 AM (in response to jugglingcats)

If you want to change replication timeout, use ConfigurationBuilder.clustering().sync().replTimeout(). However, 1 second is not that sensible as if you experience Full GC on the remote node, this will probably take more and you'd only repeat the request several times. If everything goes smooth, the timeout shouldn't be required to be set at all, and if trouble (such as GC) happen, 1 second wouldn't be enough. Also, make sure that your lock acquisition timeout (ConfigurationBuilder.locking().lockAcquisitionTimeout()) is lower than replication timeout, so that you can see the cause and don't hide everything under replication timeouts.

Regarding the errors - I haven't seen those, but quick google around the EOFException suggests that something you try to send over network is not Serializable. And the CME - are you inserting keys containing maps to the cache? If so, are you sure you are not modifying the currently inserted entry from another thread?
Actions

Go to original post