12 Replies Latest reply on Feb 7, 2011 4:09 PM by guytom19

Timeout during initial state transfer

sirianni Oct 4, 2010 9:31 PM

What is a reasonable expected time for a state transfer operation (expressed in terms of cache size)? For a reasonably small cache (~50 entries), I'm seeing a 13 second wait time for state transfer. If I put the timeout below 10sec the state transfer times out at:

{code}
Caused by: java.util.concurrent.TimeoutException: Could not obtain exclusive processing lock
    at org.infinispan.remoting.transport.jgroups.JGroupsDistSync.acquireProcessingLock(JGroupsDistSync.java:71) ~[infinispan-core.jar:4.1.0.FINAL]
    at org.infinispan.statetransfer.StateTransferManagerImpl.generateTransactionLog(StateTransferManagerImpl.java:202) ~[infinispan-
{code}

This makes me suspect that the acquiring of the cache lock on the state-providing node is taking up the majority of the 13 second state transfer time. What could be causing this long wait time? High contention for the cache lock on the node providing the state? Any suggestions on how I should go about diagnosing the bottleneck and addressing this problem?

Here is my XML configuration for reference:

{code:xml}
<infinispan
    xmlns="urn:infinispan:config:4.1"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="urn:infinispan:config:4.1 http://www.infinispan.org/schemas/infinispan-config-4.1.xsd"
>
    
    <global>
        <globalJmxStatistics jmxDomain="infinispan.dfm" enabled="true"/>
        <transport transportClass="org.infinispan.remoting.transport.jgroups.JGroupsTransport">
            <properties>
                <property name="configurationFile" value="jgroups/infinispan-stack.xml"/>
            </properties>
        </transport>
    </global>
    
    <default>
        <clustering mode="replication">
            <async/>
            <stateRetrieval fetchInMemoryState="true" timeout="20000" />
        </clustering>
        <jmxStatistics enabled="true"/>
        <locking concurrencyLevel="64" useLockStriping="false"/>
    </default>
    
</infinispan>
{code}

1. Re: Timeout during initial state transfer

cbo_ Oct 5, 2010 8:01 AM (in response to sirianni)

I suggest you may try the following settings which are working out pretty well for us at this point. Seems you are interested in a replicated cache and using async (which isn't really fully async btw). Anyway, try these if you are interested:

     <locking
            useLockStriping="false"
      />

in the default section:
     <locking
            useLockStriping="false"
      />

either on your named cache section or in your default:
         <async asyncMarshalling="true" useReplQueue="true" replQueueInterval="100" replQueueMaxElements="100"/>
         <stateRetrieval timeout="2000000" fetchInMemoryState="true" alwaysProvideInMemoryState="true"/>

I should mention that some of the above settings would only be valid if you are using 4.1.0.FINAL or newer.

1 of 1 people found this helpful
Actions
2. Re: Timeout during initial state transfer

sirianni Oct 5, 2010 8:59 AM (in response to cbo_)

Thanks for the suggestion. I am using 4.1.0.FINAL.

I'm curious why you have the stateRetrieval timeout set so high (30 min). If I understand correctly, all cache writes are basically blocked during stateRetrieval so this could effectively halt processing on your cluster for up to 30 min...
Actions
3. Re: Timeout during initial state transfer

galder.zamarreno Oct 15, 2010 12:14 PM (in response to sirianni)

Eric, why don't you run your test again and during those 10 seconds, get some thread dumps both on the node starting up and the node serving the state which is the coordinator of the cluster (first node started)?

How big is the cluster? Are you sending requests to the node serving the state while requesting the state?

Are you starting many caches in parallel, or only one?
Actions
4. Timeout during initial state transfer

guytom19 Feb 6, 2011 11:31 AM (in response to sirianni)

Hi,

Was this ever resolved?

We're encountering a very similar issue as my co-worker mentioned here http://community.jboss.org/thread/162088?tstart=0

We have 3 cache instances trying to get their state when a server loads.

Under a very light stress (much less then production...) we encounter this timeout problem.

Any direction would be helpful, if someone would like to get thread dumps we'll be happy to post them.

Thanks
Guy
Actions
5. Timeout during initial state transfer

sirianni Feb 6, 2011 12:14 PM (in response to guytom19)

No - we never resolved the issue. We have since disabled the caches due to this problem and have not yet gotten back around to attempting to root cause as Galder suggested above...
Actions
6. Timeout during initial state transfer

guytom19 Feb 6, 2011 1:45 PM (in response to sirianni)

That's worrisome, I'd expect the state transfer to be a basic capability.

You mean you disabled the caches completely (like stopped using it)? or you somehow disable the cache during the state transfer?
Actions

7. Re: Timeout during initial state transfer

sirianni Feb 7, 2011 8:25 AM (in response to guytom19)

We disabled the state transfer so that we could proceed with development work.

{code:xml}
    <default>
        <clustering mode="replication">
            <async/>
            <!-- TODO seeing some timeouts with stateRetrieval - need to diagnose before reenabling
                 <stateRetrieval fetchInMemoryState="true" timeout="20000" numRetries="1"/> 
            -->
        </clustering>
    </default>
{code}

We are going to need to fix this before we going to production. I agree this is troubling. Hopefully you can make some progress on the issue that will help us out as well

8. Timeout during initial state transfer

cbo_ Feb 7, 2011 9:06 AM (in response to sirianni)

Sorry I had not responded to your earlier post. I think the reason my timeout value was set so high is a leftover from when we too were seeing some lengthy state transfer times. At the moment I can not recall all the small adjustments that led us to where we are, but I recommend you try similar settings as we are getting very respectable state transfers at this point. In a 2 node replicated configuration using 4.2.0.FINAL I am seeing state transfers of some reasonably wide value objects with the following timings:

10,000 entries   ==> 8.27 seconds
100,000 entries ==> 42.65 seconds

I would like to add that since we are using the replicationqueue we are able to modify the cache during the statetransfer processing. Those deltas are added to the queue during that time.

I will indicate the latest settings we are using in case you want to give this another try:

      <clustering mode="replication">
         <async asyncMarshalling="true" useReplQueue="true" replQueueInterval="10" replQueueMaxElements="100" />
         <stateRetrieval timeout="2000000" fetchInMemoryState="false" alwaysProvideInMemoryState="true"/>
      </clustering>

And, leave the useLockStriping="false" as indicated by each of us above. You can reduce the timeout value on stateRetrieval based on my timings above.

The final thing to investigate I guess would be whether you have a "clean" transport cluster (i.e. jgroups). I recall some things we were doing early on that was causing sort of interference in the cluster so we went to a separation. For example, across 2 machines we had several apps that each have a cache. I will name the apps using letters A, B, C and the machines will be numbered 1 and 2 for clarity. So we want A1 to share its cache with A2. We want B1 to share with B2, and so on. In order to accomplish this we created a separate transport cluster for the app A, and a separate one for app B, and so on. This is driven by your transport clusterName= as well as the corresponding jgroups config/file.

Hope it helps.
Actions
9. Timeout during initial state transfer

dror76 Feb 7, 2011 10:24 AM (in response to cbo_)

I'm also suffering from this issue and I would like to understand something about the recent configuration you posted -
what is the meaning of "stateRetrieval" with fetchInMemoryState="false" ?

Thanks
Actions
10. Timeout during initial state transfer

cbo_ Feb 7, 2011 10:33 AM (in response to dror76)

It was an oversight on my part to post it that way. We set both sides of an app initially to "false" for this setting, but we override that value to "true" in the case of application that is coming up and needs to retrieve state. This is not related to a bug or limitation within Infinispan, but rather just the way we want things to function. We are maintaining control on which side of an application is "in charge" and therefore don't want to bother transferring state to the "in charge" side in the case it may start up after his partner has entries in the cache. In this case we call clear() on the cache anyway so a stateTransfer would simply waste time.
Actions
11. Timeout during initial state transfer

manik Feb 7, 2011 1:47 PM (in response to sirianni)

There are a couple of alternative techniques to state transfer, including using a ClusteredCacheLoader. This loads state lazily on first access, and gives you immediate startup.
Actions
12. Timeout during initial state transfer

guytom19 Feb 7, 2011 4:09 PM (in response to sirianni)

We'll look into this option.

We alreasdy recovered one problem, it seems that one of our caches was constantly empty and that caused a problem during state transfer. This seems like some edge case bug.

Anyway, now we're still seeing other problems and a VERY slow state tranfer although we use the same configuration craig suggested above.
Actions

Go to original post