11 Replies Latest reply on Jan 27, 2014 2:12 PM by Jack Frost

    CacheException: Initial state transfer timed out for cache

    dex chen Novice

      I have a 3 node cluster with replication configuration. I am seeing the "Initial state transfer timed out exception" during the initial start up of the cluster. This is definitely related to the number of entries in the cache store (we used rdmb/apache derby as persistent store).  The time out exception happens when there are about 1 million cache entries.

      I already have a very big timeout value in the configure (see below).

       

      To me this is a critical scaling issue. I saw this timeout happens more frequently when the size of cluster (number of nodes) goes up to 5 or higher.

       

      Anyone has any suggestion to work around this?

       

      Here are the configuration for the cache:

      ....

      <default>

           <locking concurrencyLevel="5000" isolationLevel="READ_COMMITTED" lockAcquisitionTimeout="60000" useLockStriping="false" writeSkewCheck="false"/>

            <transaction cacheStopTimeout="30000" eagerLockSingleNode="false" lockingMode="OPTIMISTIC" syncCommitPhase="false" syncRollbackPhase="false" transactionManagerLookupClass="org.infinispan.transaction.lookup.JBossStandaloneJTAManagerLookup" useEagerLocking="false"/>

       

            <deadlockDetection enabled="true" spinDuration="1000"/>

            <jmxStatistics enabled="true"/>

       

          </default>

         <namedCache name="KEYCHAIN">

            <clustering mode="replication">

               <stateTransfer chunkSize="10000" fetchInMemoryState="true" timeout="400000"/>

               <async asyncMarshalling="true" replQueueInterval="1000" replQueueMaxElements="1000" useReplQueue="true"/>

       

            </clustering>

            <expiration lifespan="-1" maxIdle="-1" wakeUpInterval="-1"/>

       

            <loaders passivation="false" preload="false" shared="false">

                 <store class="org.infinispan.loaders.jdbc.stringbased.JdbcStringBasedCacheStore" fetchPersistentState="false" purgeOnStartup="false">

       

                    <properties>

                      <property name="stringsTableNamePrefix" value="ISPN_STRING_TABLE"/>

                      <property name="idColumnName" value="ID_COLUMN"/>

                      <property name="dataColumnName" value="DATA_COLUMN"/>

                      <property name="timestampColumnName" value="TIMESTAMP_COLUMN"/>

                      <property name="timestampColumnType" value="BIGINT"/>

                      <property name="connectionFactoryClass" value="org.infinispan.loaders.jdbc.connectionfactory.ManagedConnectionFactory"/>

                      <property name="datasourceJndiLocation" value="java:comp/env/jdbc/UP_DB"/>

       

                      <property name="idColumnType" value="VARCHAR(255)"/>

                      <property name="dataColumnType" value="BLOB"/>

                      <property name="dropTableOnExit" value="false"/>

                      <property name="createTableOnStart" value="true"/>

      ....

        • 2. Re: CacheException: Initial state transfer timed out for cache
          Dan Berindei Expert

          Can you try removing the RSVP protocol from the JGroups configuration?

           

          We have noticed a problem with state transfer timing out because RSVP ACK messages are blocked by other commands: [ISPN-3878] Unhandled failing ST cancel leads to deadlock, and we think removing the RSVP protocol should fix it. If you're not using UNICAST3 yet, I would recommend upgrading to that as well.

          • 3. Re: CacheException: Initial state transfer timed out for cache
            dex chen Novice

            Just give a update on what I ended up for now.

            I ended up to set the "awaitInitialTransfer="false" for the stateTransfer.  With this change, the initial startup of a cluster is faster and reliable (i.e. not seeing the timeout issue). Of course, I took the trade of the cache data sync.

            I also increased the chuckSize to 10K.

             

            Please do let me know if you see any issues with my configuration (awaitInitialTransfer=false).

             

            Dan, ISPN-3878 seems to match our problem. I'll try UNICAST3 when time allows to.

             

            Thanks.

            • 4. Re: CacheException: Initial state transfer timed out for cache
              Jack Frost Newbie

              I may need to open a new thread.  I seem to be having similar problems.  First let me say I tried setting awaitIntialTransfer=false and I commented out RSVP in the jgroups config.  And I have UNICAST3 enabled.  No change.

               

              When my cache has 500-1000 entries and a new node joins the cluster I get the Timeout error on the coordinator:

              Caused by: org.jgroups.TimeoutException: timeout sending message to jfrost-multi2-25342
                      at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:419)
                      at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:353)
                      at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:167)
              

               

              If I have something in excess of a thousand(I don't have the exact number). I also get a bunch of entries like those below before the timeout occurs.:

               

              137133 [ERROR] UDP: - JGRP000029: jfrost-multi-59483: failed sending message to jfrost-multi2-25342 (30077 bytes): java.lang.Exception: dest=/192.168.10.193:30198 (30080 bytes), headers: FRAG2: [id=1, frag_id=22, num_frags=63], UNICAST3: DATA, seqno=29, conn_id=1, UDP: [channel_name=ISPN]
              

               

              This doesn't bode well.  This doesn't seem like a complex case and I'm having trouble.  I can post my jgroups config file but it's pretty much the jgroups-udp.xml file supplies as a sample with Infinispan with the tweaks mentioned above.  If I can't get this simple case to work I may have to abandon the project.

               

              The really hard thing to deal with is the data seems to making it to the new node.   The error shows up sometime after the data was transferred.  Is this only a problem with initial state transfer or something that could show up anytime?

              • 5. Re: CacheException: Initial state transfer timed out for cache
                Pedro Ruivo Novice

                Hi Jack,

                 

                do you have a log file with the problem? can you upload it together with the configuration used (infinispan + jgroups)?

                 

                Cheers,

                Pedro

                • 6. Re: Re: CacheException: Initial state transfer timed out for cache
                  Jack Frost Newbie

                  Hey Pedro,

                   

                  Thanks for noticing my problem.

                   

                  For this run I was using Infinispan 6.0.0 embedded in a Tomcat 6 container running on a FreeBSD server.

                   

                  Attached you should find a reasonably short log.  It's running pretty much the defaults from the log4j.xml sample I found on the Infinispan site.  It is relatively easy to run the test so if you would like a different set of logging parameters let me know.

                   

                  In this particular case there were only 100 entries in the cache.  So the buffer problem does not seem to be present and as far as I can tell is not directly associated with this time out.  The key and value for the cache are very simple. The key is a serializable pair made up of two Integers and the value is an Integer.

                   

                  It is true that the timeout seems to occur in RSVP if it is be used.  RSVP is not in this particular run.  You can also see that awaitInitialResponse is also set false as per earlier recommendation as well as a 10K chunkSize.  I have the Transaction timeout set very high trying to ease debugging.  I would use something more reasonable in a live deployment.

                   

                  Below you will find the code patch I'm using for Infinispan config and I'm attaching the Jgroups XML config file.

                   

                  I've been checking both Inifinspan 5.3.0 and 6.0.0.  I5.3 uses jgroups  3.3.1 while 6.0.0 ships with jgroups 3.4.1.  The timeout seems to be common to both. 

                   

                  Let me know if you need any other info.

                   

                  /Jack

                   


                   

                        DefaultCacheManager cacheManager = new DefaultCacheManager(
                           GlobalConfigurationBuilder.defaultClusteredBuilder()
                              .globalJmxStatistics()
                                 .enable()
                              .transport()
                                 .addProperty("configurationFile", "jgroups-ee-udp.xml")
                              .build(),
                           new ConfigurationBuilder()
                              .jmxStatistics()
                                 .enable()
                              .clustering()
                                 .cacheMode(CacheMode.REPL_ASYNC)
                                 .stateTransfer()
                                    .fetchInMemoryState(true)
                                    .awaitInitialTransfer(false)
                                    .chunkSize(10240)
                                    .timeout(2, TimeUnit.MINUTES)
                              .transaction()
                                 .transactionMode(TransactionMode.TRANSACTIONAL)
                                 .transactionManagerLookup(new JBossStandaloneJTAManagerLookup())
                                 .completedTxTimeout(DateTimeUtil.MILLISECONDS_PER_MINUTE * 20)
                              .build());
                  
                  
                  • 7. Re: Re: CacheException: Initial state transfer timed out for cache
                    Pedro Ruivo Novice

                    Hi Jack,

                     

                    Unfortunately I'm unable to identify the problem from the logs. Can you reproduce the test with the following changes?

                     

                    * TRACE log in org.infinispan.remoting package and org.jgroups.protocol.UDP class

                    * It would be nice if you can take a thread dump of both servers during that 2 min window

                    * (optional) make each server to log in different files


                    Thanks


                    Pedro

                    • 8. Re: Re: Re: CacheException: Initial state transfer timed out for cache
                      Jack Frost Newbie

                      Hey Pedro,

                       

                      No problem.  I think I have the info you want.  It wasn't quite clear if you want the UDP separate from the infinispan logging or your really meant servers.  But that's what I did.    There are two nodes in the test.  The zip file contains an infinispan log, a udp log and a threaddump I triggered about 1 minute into the timeout period.

                       

                      Just to recap,  Node1 comes up creates the cache and loads some entries into the cache.  After pre-load is  over and node1 is stable I start node2.  Timeout is about 2 minutes later as per the configuration value.  Timeout is on Node1 the cluster coordinator.

                       

                      In getting the files It still seems to me like there is a cache entry size/count dependency.  In a few cases where entries have been <= 100 the timeout doesn't seem to occur.  Then there is a zone where the timeout occurs but all else seems fine.  Finally if I load up a large sample, >=5000 I'm hitting some kind of send failure which appears on the surface to be buffering related.  But I'm worried this is a red herring as the 5.3 release reports "No buffers available" which is known to sometimes mask other low level problems.  6.0 seems to be getting the same error but the error message logged is of a slightly different form that doesn't report the cause message. 

                       

                      Anyway, hope these files have something interesting.  Let me know if you need more.

                       

                      /Jack

                      • 9. Re: Re: Re: CacheException: Initial state transfer timed out for cache
                        dex chen Novice

                        I am interested to see what root cause of this is too.  We have seen similar issues in our deployments since ISPN 5.1 times, and the problem definitely depends upon the size of cluster and cache entries. In other words, this happens more often in bigger clusters (4 or 5 nodes) and big number of cache entries. Our configuration uses TCP.

                        • 10. Re: Re: Re: Re: CacheException: Initial state transfer timed out for cache
                          Pedro Ruivo Novice

                          Hi Jack,

                           

                          I took a look at the logs and I think that is a network problem. I noticed that node2 is not receiving any messages from node1 and the latter is always trying to retry. Do you have some kind of firewall? Also, I've noticed that the small messages (<1 KB) are passing correctly (note that the state transfer messages are > 30 KB).

                           

                          My suggestion is to use UUPerf in that 2 nodes to try to find out the problem. It mimics the state transfer.

                           

                          java -cp /path/to/jgroups.jar org.jgroups.tests.perf.UUPerf -props /path/to/jgroups-ee-udp.xml  
                          

                           

                          When it starts, it shows some options and you can configure the number of messages to send (default 1) and the size of them (default 4.5 MB).

                           

                          Other suggestion that I can give is to increase the buffers in jgroups configuration file and/or in the operating system.

                           

                          if you more need help let me know.

                           

                          Cheers,

                          Pedro

                          • 11. Re: Re: Re: Re: CacheException: Initial state transfer timed out for cache
                            Jack Frost Newbie

                            Thanks Pedro.

                             

                            I've had our sysadmin check the buffer settings but I'll try the test program and see if anything else comes to light.  I also had the jgroups buffers turned up higher at one point but was worried that I had over done it.  I'll make them a little larger and see what happens.

                             

                            It's starting to seem like my two problems are related.  As I mentioned when there are lots of cache entries(> 5000) I start getting buffer errors from Infinispan/jgroups.   It's seems like there is a process that is using buffers and not returning them as the starvation seems to get worse as time system keeps trying to send messages.   Seems like that could be linked to a process that isn't running properly.

                             

                            I'll get back with any results.

                             

                            /Jack