6 Replies Latest reply on Sep 15, 2014 3:24 AM by rvansa

    Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)

    jugglingcats

      Hi, we have a three node ISPN cluster running in production. It does around 80 cache reads per second sustained, around 20 are loaded from backing store (mongo). Most are modified and written back to the cache, with async write-behind to the store. ISPN version is 6.0.0.Final.

       

      The cluster runs fine for a few days but then, typically under times of higher than normal load, it starts to fail badly, with all three nodes reporting errors:

       

      org.infinispan.util.concurrent.TimeoutException: Timed out waiting for 15 seconds for valid responses from any of [Sender{address=xxxxxxx-ad3-prod-12362, responded=false}].

       

      We can telnet from the node in question to the remote node on the jgroups port, so connectivity isn't an issue.

       

      There is another log entry that appears much less often, and I'm not sure if it's related...

       

      04-09-2014 16:36:16,635 WARN  [INT-8,ISPN,vocento-ad2-prod-28582] org.jgroups.protocols.TCP  - Discarding message because TCP send_queue is full and hasn't been releasing for 300 ms

       

      Has anyone else seen these errors? Basically our cluster is dead at this point... 50% of responses to clients are above 15 seconds! It happens very suddenly - everything looks healthy then logs are full of errors :-(

       

      I was slightly worried to see this bug report filed recently: https://issues.jboss.org/browse/ISPN-4615.

       

      Our cache is configured as follows:

       

              ConfigurationBuilder builder = new ConfigurationBuilder();

              if (isClustered()) {

                  builder.clustering()

                          .cacheMode(distributedCacheMode) // DIST_ASYNC

                          .hash()

                          .numOwners(distributedCacheOwners); // 1 owner

              }

       

              builder.locking().concurrencyLevel(concurrency); // 500

       

              if (cacheInfo.persistence()) {

                  builder.persistence()

                          .addStore(MongoStoreConfigurationBuilder.class) // a custom mongodb store implementation

                          .shared(true)

                          .mongoDatabase(db)

                          .jsonStore(jsonStore)

                          .persistentType(type)

                          .metricsRegistry(metrics)

                          .fetchPersistentState(true)

                          .async().enabled(true)

                          .modificationQueueSize(modificationQueueSize)

                          .threadPoolSize(threadPoolSize);

              }

              builder.eviction()

                      .maxEntries(defaultCacheSize) // 100,000

                      .strategy(EvictionStrategy.LIRS);

       

      Any help tracking down the source of this issue massively appreciated, or advice on possibly tuning options to alleviate it.

       

      Thanks!

        • 1. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)
          rvansa

          Add this point, most helpful for investigating that issue would be dump of all threads' stacks (use jstack or kill -3).

           

          As wild guess, I'd ask for your JGroups configuration - do you use protocol UFC (that's not recommended with TCP I think), and if yes what's your UFC.max_credits limit?

           

          Btw., Infinispan has some MongoDB cachestore implementation https://github.com/infinispan/infinispan-cachestore-mongodb - if your implementation is better, contributions are welcome.

          • 2. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)
            jugglingcats

            Hi Radim, appreciate the reply...

             

            Below is our jgroups config. We use a custom ping class as we run in EC2 (no UDP) and we use Mongo to coordinate the cluster.

             

            When we started, the Mongo cachestore implementation hadn't been updated for ISPN 6.x. Also, we have some very specific requirements about how values are persisted - we wanted to have objects readable in Mongo (for reporting, etc), and so our cachestore insists on objects being serializable by Jackson to BSON, munges property objects, uses ObjectId keys, etc, etc.

             

            <config xmlns="urn:org:jgroups"

                    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

                    xsi:schemaLocation="urn:org:jgroups http://www.jgroups.org/schema/JGroups-3.4.xsd">

                <TCP

                        bind_addr="${jgroups.tcp.address:127.0.0.1}"

                        bind_port="${jgroups.tcp.port:7800}"

                        loopback="true"

                        port_range="30"

                        recv_buf_size="20000000"

                        send_buf_size="640000"

                        max_bundle_size="31k"

                        use_send_queues="true"

                        sock_conn_timeout="300"

                        enable_diagnostics="false"

             

             

                        bundler_type="old"

             

             

                        thread_pool.enabled="true"

                        thread_pool.min_threads="2"

                        thread_pool.max_threads="30"

                        thread_pool.keep_alive_time="60000"

                        thread_pool.queue_enabled="false"

                        thread_pool.queue_max_size="100"

                        thread_pool.rejection_policy="Discard"

             

             

                        oob_thread_pool.enabled="true"

                        oob_thread_pool.min_threads="2"

                        oob_thread_pool.max_threads="30"

                        oob_thread_pool.keep_alive_time="60000"

                        oob_thread_pool.queue_enabled="false"

                        oob_thread_pool.queue_max_size="100"

                        oob_thread_pool.rejection_policy="Discard"

             

             

                        internal_thread_pool.enabled="true"

                        internal_thread_pool.min_threads="1"

                        internal_thread_pool.max_threads="10"

                        internal_thread_pool.keep_alive_time="60000"

                        internal_thread_pool.queue_enabled="true"

                        internal_thread_pool.queue_max_size="100"

                        internal_thread_pool.rejection_policy="Discard"

                        />

             

             

                <com.evolok.ad.cache.infinispan.MongoPing/>

             

             

                <MERGE2 max_interval="30000"

                        min_interval="10000"/>

                <FD_SOCK/>

                <FD timeout="3000" max_tries="5"/>

                <VERIFY_SUSPECT timeout="1500"/>

                <pbcast.NAKACK2

                        use_mcast_xmit="false"

                        xmit_interval="1000"

                        xmit_table_num_rows="100" xmit_table_msgs_per_row="10000"

                        xmit_table_max_compaction_time="10000"

                        max_msg_batch_size="100"/>

             

             

                <UNICAST3

                        xmit_table_num_rows="20"

                        xmit_table_msgs_per_row="10000"

                        xmit_table_max_compaction_time="10000"

                        max_msg_batch_size="100"

                        conn_expiry_timeout="0"/>

                <pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"

                               max_bytes="400000"/>

                <pbcast.GMS print_local_addr="false" join_timeout="7000" view_bundling="true"/>

                <tom.TOA/> <!-- the TOA is only needed for total order transactions-->

                <UFC max_credits="2m" min_threshold="0.40"/>

                <MFC max_credits="2m" min_threshold="0.40"/>

                <FRAG2 frag_size="30k"  />

                <RSVP timeout="60000" resend_interval="500" ack_on_delivery="false"/>

            </config>

            • 3. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)
              rvansa

              Without thread dumps we can't see where the threads are stuck, but the JGroups configuration has only a decent OOB thread pool - I'd recommend raising TCP.oob_thread_pool.max_threads to 500 (usually you wouldn't have that many threads running, but in case that something goes wrong and few threads get stuck, there should be enough other threads to keep the node responsive, and eventually the problem could timeout).

              • 4. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)
                jugglingcats

                Hi Radim, I am guessing you need the thread dumps when the problem is actually happening. I will try to get some!

                 

                I am also seeing some other errors (shown below) at times. They don't coincide with the timeout issues and don't seem to affect the operation of the cluster, but they are still worrying.

                 

                These types of errors appear on all nodes, often but not always at around the same time:

                 

                08-09-2014 07:12:13,864 ERROR [OOB-5120,ISPN,vocento-ad3-prod-13816] org.jgroups.blocks.RequestCorrelator  - failed unmarshalling buffer into return value

                java.io.EOFException: Read past end of file

                        at org.jboss.marshalling.SimpleDataInput.eofOnRead(SimpleDataInput.java:126)

                        at org.jboss.marshalling.SimpleDataInput.readUnsignedByteDirect(SimpleDataInput.java:263)

                        at org.jboss.marshalling.SimpleDataInput.readUnsignedByte(SimpleDataInput.java:224)

                        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:213)

                  ...

                 

                 

                08-09-2014 07:12:35,249 ERROR [http-8080-80] org.infinispan.interceptors.InvocationContextInterceptor  - ISPN000136: Execution error

                org.infinispan.commons.CacheException: java.util.ConcurrentModificationException

                        at org.infinispan.commons.util.Util.rewrapAsCacheException(Util.java:581)

                        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:141)

                        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:524)

                        at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:281)

                        at org.infinispan.interceptors.distribution.BaseDistributionInterceptor.invokeClusterGetCommandRemotely(BaseDistributionInterceptor.java:129)

                  ...

                Caused by: java.util.ConcurrentModificationException

                        at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)

                        at java.util.HashMap$EntryIterator.next(HashMap.java:834)

                        at java.util.HashMap$EntryIterator.next(HashMap.java:832)

                        at org.infinispan.commons.marshall.MarshallUtil.marshallMap(MarshallUtil.java:34)

                        at org.infinispan.marshall.exts.MapExternalizer.writeObject(MapExternalizer.java:53)

                  ...

                 

                 

                08-09-2014 12:53:31,896 ERROR [http-8080-108] org.infinispan.interceptors.InvocationContextInterceptor  - ISPN000136: Execution error

                org.infinispan.commons.CacheException: java.io.EOFException: Read past end of file

                        at org.infinispan.commons.util.Util.rewrapAsCacheException(Util.java:581)

                        at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:141)

                        at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:524)

                  ...

                Caused by: java.io.EOFException: Read past end of file

                        at org.jboss.marshalling.SimpleDataInput.eofOnRead(SimpleDataInput.java:126)

                        at org.jboss.marshalling.SimpleDataInput.readUnsignedByteDirect(SimpleDataInput.java:263)

                        at org.jboss.marshalling.SimpleDataInput.readUnsignedByte(SimpleDataInput.java:224)

                        at org.jboss.marshalling.river.RiverUnmarshaller.doReadObject(RiverUnmarshaller.java:213)

                        at org.jboss.marshalling.AbstractObjectInput.readObject(AbstractObjectInput.java:37)

                  ...

                 

                Any info on tracking down the root cause of these appreciated.

                 

                Thanks, Alfie.

                • 5. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)
                  jugglingcats

                  This definitely happens when other parts of the infrastructure are struggling, eg. Mongo, although it doesn't seem to take much to tip things into a cascade effect.

                   

                  We are going to increase threadpool limits, but would also like to decrease the timeout from 15 seconds to something more sensible like 1 second.

                   

                  Where is this timeout configured? Is it set on transport().distributedSyncTimeout() or somewhere else?

                   

                  Thanks.

                  • 6. Re: Cluster issue with distributed cache (org.infinispan.util.concurrent.TimeoutException)
                    rvansa

                    If you want to change replication timeout, use ConfigurationBuilder.clustering().sync().replTimeout(). However, 1 second is not that sensible as if you experience Full GC on the remote node, this will probably take more and you'd only repeat the request several times. If everything goes smooth, the timeout shouldn't be required to be set at all, and if trouble (such as GC) happen, 1 second wouldn't be enough. Also, make sure that your lock acquisition timeout (ConfigurationBuilder.locking().lockAcquisitionTimeout()) is lower than replication timeout, so that you can see the cause and don't hide everything under replication timeouts.

                     

                    Regarding the errors - I haven't seen those, but quick google around the EOFException suggests that something you try to send over network is not Serializable. And the CME - are you inserting keys containing maps to the cache? If so, are you sure you are not modifying the currently inserted entry from another thread?