12 Replies Latest reply on Feb 7, 2011 4:09 PM by guytom19

    Timeout during initial state transfer

    sirianni

      What is a reasonable expected time for a state transfer operation (expressed in terms of cache size)?  For a reasonably small cache (~50 entries), I'm seeing a 13 second wait time for state transfer.  If I put the timeout below 10sec the state transfer times out at:

       

      {code}
      Caused by: java.util.concurrent.TimeoutException: Could not obtain exclusive processing lock
          at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:71) ~[infinispan-core.jar:4.1.0.FINAL]
          at org.infinispan.statetransfer.StateTransferManagerImpl.generateTransactionLog(StateTransferManagerImpl.java:202) ~[infinispan-
      {code}

       

      This makes me suspect that the acquiring of the cache lock on the state-providing node is taking up the majority of the 13 second state transfer time.  What could be causing this long wait time?  High contention for the cache lock on the node providing the state?  Any suggestions on how I should go about diagnosing the bottleneck and addressing this problem?

       

      Here is my XML configuration for reference:

      {code:xml}
      <infinispan
          xmlns="urn:infinispan:config:4.1"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="urn:infinispan:config:4.1 http://www.infinispan.org/schemas/infinispan-config-4.1.xsd"
      >
         
          <global>
              <globalJmxStatistics jmxDomain="infinispan.dfm" enabled="true"/>
              <transport transportClass="org.infinispan.remoting.transport.jgroups.JGroupsTransport">
                  <properties>
                      <property name="configurationFile" value="jgroups/infinispan-stack.xml"/>
                  </properties>
              </transport>
          </global>
         
          <default>
              <clustering mode="replication">
                  <async/>
                  <stateRetrieval fetchInMemoryState="true" timeout="20000" />
              </clustering>
              <jmxStatistics enabled="true"/>
              <locking concurrencyLevel="64" useLockStriping="false"/>
          </default>
         
      </infinispan>
      {code}
        • 1. Re: Timeout during initial state transfer
          cbo_

          I suggest you may try the following settings which are working out pretty well for us at this point.  Seems you are interested in a replicated cache and using async (which isn't really fully async btw).  Anyway, try these if you are interested:

           

               <locking
                      useLockStriping="false"
                />

          in the default section:

               <locking

                      useLockStriping="false"

                />

          either on your named cache section or in your default:
                   <async asyncMarshalling="true" useReplQueue="true" replQueueInterval="100" replQueueMaxElements="100"/>
                   <stateRetrieval timeout="2000000" fetchInMemoryState="true" alwaysProvideInMemoryState="true"/>
          I should mention that some of the above settings would only be valid if you are using 4.1.0.FINAL or newer.

          1 of 1 people found this helpful
          • 2. Re: Timeout during initial state transfer
            sirianni

            Thanks for the suggestion.  I am using 4.1.0.FINAL.

             

            I'm curious why you have the stateRetrieval timeout set so high (30 min).  If I understand correctly, all cache writes are basically blocked during stateRetrieval so this could effectively halt processing on your cluster for up to 30 min...    

            • 3. Re: Timeout during initial state transfer
              galder.zamarreno

              Eric, why don't you run your test again and during those 10 seconds, get some thread dumps both on the node starting up and the node serving the state which is the coordinator of the cluster (first node started)?

               

              How big is the cluster? Are you sending requests to the node serving the state while requesting the state?

               

              Are you starting many caches in parallel, or only one?

              • 4. Timeout during initial state transfer
                guytom19

                Hi,

                 

                Was this ever resolved?

                 

                We're encountering a very similar issue as my co-worker mentioned here http://community.jboss.org/thread/162088?tstart=0

                 

                We have 3 cache instances trying to get their state when a server loads.

                 

                Under a very light stress (much less then production...) we encounter this timeout problem.

                 

                Any direction would be helpful, if someone would like to get thread dumps we'll be happy to post them.

                 

                Thanks

                Guy

                • 5. Timeout during initial state transfer
                  sirianni

                  No - we never resolved the issue.  We have since disabled the caches due to this problem and have not yet gotten back around to attempting to root cause as Galder suggested above...    

                  • 6. Timeout during initial state transfer
                    guytom19

                    That's worrisome, I'd expect the state transfer to be a basic capability.

                     

                    You mean you disabled the caches completely (like stopped using it)? or you somehow disable the cache during the state transfer?

                    • 7. Re: Timeout during initial state transfer
                      sirianni

                      We disabled the state transfer so that we could proceed with development work.

                       

                      {code:xml}

                          <default>

                              <clustering mode="replication">

                                  <async/>

                                  <!-- TODO seeing some timeouts with stateRetrieval - need to diagnose before reenabling

                                       <stateRetrieval fetchInMemoryState="true" timeout="20000" numRetries="1"/>

                                  -->

                              </clustering>

                          </default>

                      {code}

                       

                       

                      We are going to need to fix this before we going to production.  I agree this is troubling.  Hopefully you can make some progress on the issue that will help us out as well

                      • 8. Timeout during initial state transfer
                        cbo_

                        Sorry I had not responded to your earlier post.  I think the reason my timeout value was set so high is a leftover from when we too were seeing some lengthy state transfer times.  At the moment I can not recall all the small adjustments that led us to where we are, but I recommend you try similar settings as we are getting very respectable state transfers at this point.  In a 2 node replicated configuration using 4.2.0.FINAL I am seeing state transfers of some reasonably wide value objects with the following timings:

                         

                        10,000 entries   ==> 8.27 seconds

                        100,000 entries ==> 42.65 seconds

                         

                        I would like to add that since we are using the replicationqueue we are able to modify the cache during the statetransfer processing.  Those deltas are added to the queue during that time.

                         

                        I will indicate the latest settings we are using in case you want to give this another try:

                         

                              <clustering mode="replication">

                                 <async asyncMarshalling="true" useReplQueue="true" replQueueInterval="10" replQueueMaxElements="100" />

                                 <stateRetrieval timeout="2000000" fetchInMemoryState="false" alwaysProvideInMemoryState="true"/>

                              </clustering>

                         

                         

                        And, leave the useLockStriping="false" as indicated by each of us above.  You can reduce the timeout value on stateRetrieval based on my timings above. 

                         

                        The final thing to investigate I guess would be whether you have a "clean" transport cluster (i.e. jgroups).  I recall some things we were doing early on that was causing sort of interference in the cluster so we went to a separation.  For example, across 2 machines we had several apps that each have a cache.  I will name the apps using letters A, B, C and the machines will be numbered 1 and 2 for clarity.  So we want A1 to share its cache with A2.  We want B1 to share with B2, and so on.  In order to accomplish this we created a separate transport cluster for the app A, and a separate one for app B, and so on.  This is driven by your transport clusterName= as well as the corresponding jgroups config/file.

                         

                        Hope it helps.

                        • 9. Timeout during initial state transfer
                          dror76

                          I'm also suffering from this issue and I would like to understand something about the recent configuration you posted -

                          what is the meaning of "stateRetrieval" with fetchInMemoryState="false" ?

                           

                          Thanks

                          • 10. Timeout during initial state transfer
                            cbo_

                            It was an oversight on my part to post it that way.  We set both sides of an app initially to "false" for this setting, but we override that value to "true" in the case of application that is coming up and needs to retrieve state.  This is not related to a bug or limitation within Infinispan, but rather just the way we want things to function.  We are maintaining control on which side of an application is "in charge" and therefore don't want to bother transferring state to the "in charge" side in the case it may start up after his partner has entries in the cache.  In this case we call clear() on the cache anyway so a stateTransfer would simply waste time.

                            • 11. Timeout during initial state transfer
                              manik

                              There are a couple of alternative techniques to state transfer, including using a ClusteredCacheLoader.  This loads state lazily on first access, and gives you immediate startup.

                              • 12. Timeout during initial state transfer
                                guytom19

                                We'll look into this option.

                                 

                                We alreasdy recovered one problem, it seems that one of our caches was constantly empty and that caused a problem during state transfer. This seems like some edge case bug.

                                 

                                Anyway, now we're still seeing other problems and a VERY slow state tranfer although we use the same configuration craig suggested above.