1 of 1 people found this helpful
I suggest you may try the following settings which are working out pretty well for us at this point. Seems you are interested in a replicated cache and using async (which isn't really fully async btw). Anyway, try these if you are interested:<lockinguseLockStriping="false"/>
in the default section:
/>either on your named cache section or in your default:<async asyncMarshalling="true" useReplQueue="true" replQueueInterval="100" replQueueMaxElements="100"/><stateRetrieval timeout="2000000" fetchInMemoryState="true" alwaysProvideInMemoryState="true"/>I should mention that some of the above settings would only be valid if you are using 4.1.0.FINAL or newer.
Thanks for the suggestion. I am using 4.1.0.FINAL.
I'm curious why you have the stateRetrieval timeout set so high (30 min). If I understand correctly, all cache writes are basically blocked during stateRetrieval so this could effectively halt processing on your cluster for up to 30 min...
Eric, why don't you run your test again and during those 10 seconds, get some thread dumps both on the node starting up and the node serving the state which is the coordinator of the cluster (first node started)?
How big is the cluster? Are you sending requests to the node serving the state while requesting the state?
Are you starting many caches in parallel, or only one?
Was this ever resolved?
We're encountering a very similar issue as my co-worker mentioned here http://community.jboss.org/thread/162088?tstart=0
We have 3 cache instances trying to get their state when a server loads.
Under a very light stress (much less then production...) we encounter this timeout problem.
Any direction would be helpful, if someone would like to get thread dumps we'll be happy to post them.
No - we never resolved the issue. We have since disabled the caches due to this problem and have not yet gotten back around to attempting to root cause as Galder suggested above...
That's worrisome, I'd expect the state transfer to be a basic capability.
You mean you disabled the caches completely (like stopped using it)? or you somehow disable the cache during the state transfer?
We disabled the state transfer so that we could proceed with development work.
<!-- TODO seeing some timeouts with stateRetrieval - need to diagnose before reenabling
<stateRetrieval fetchInMemoryState="true" timeout="20000" numRetries="1"/>
We are going to need to fix this before we going to production. I agree this is troubling. Hopefully you can make some progress on the issue that will help us out as well
Sorry I had not responded to your earlier post. I think the reason my timeout value was set so high is a leftover from when we too were seeing some lengthy state transfer times. At the moment I can not recall all the small adjustments that led us to where we are, but I recommend you try similar settings as we are getting very respectable state transfers at this point. In a 2 node replicated configuration using 4.2.0.FINAL I am seeing state transfers of some reasonably wide value objects with the following timings:
10,000 entries ==> 8.27 seconds
100,000 entries ==> 42.65 seconds
I would like to add that since we are using the replicationqueue we are able to modify the cache during the statetransfer processing. Those deltas are added to the queue during that time.
I will indicate the latest settings we are using in case you want to give this another try:
<async asyncMarshalling="true" useReplQueue="true" replQueueInterval="10" replQueueMaxElements="100" />
<stateRetrieval timeout="2000000" fetchInMemoryState="false" alwaysProvideInMemoryState="true"/>
And, leave the useLockStriping="false" as indicated by each of us above. You can reduce the timeout value on stateRetrieval based on my timings above.
The final thing to investigate I guess would be whether you have a "clean" transport cluster (i.e. jgroups). I recall some things we were doing early on that was causing sort of interference in the cluster so we went to a separation. For example, across 2 machines we had several apps that each have a cache. I will name the apps using letters A, B, C and the machines will be numbered 1 and 2 for clarity. So we want A1 to share its cache with A2. We want B1 to share with B2, and so on. In order to accomplish this we created a separate transport cluster for the app A, and a separate one for app B, and so on. This is driven by your transport clusterName= as well as the corresponding jgroups config/file.
Hope it helps.
I'm also suffering from this issue and I would like to understand something about the recent configuration you posted -
what is the meaning of "stateRetrieval" with fetchInMemoryState="false" ?
It was an oversight on my part to post it that way. We set both sides of an app initially to "false" for this setting, but we override that value to "true" in the case of application that is coming up and needs to retrieve state. This is not related to a bug or limitation within Infinispan, but rather just the way we want things to function. We are maintaining control on which side of an application is "in charge" and therefore don't want to bother transferring state to the "in charge" side in the case it may start up after his partner has entries in the cache. In this case we call clear() on the cache anyway so a stateTransfer would simply waste time.
There are a couple of alternative techniques to state transfer, including using a ClusteredCacheLoader. This loads state lazily on first access, and gives you immediate startup.
We'll look into this option.
We alreasdy recovered one problem, it seems that one of our caches was constantly empty and that caused a problem during state transfer. This seems like some edge case bug.
Anyway, now we're still seeing other problems and a VERY slow state tranfer although we use the same configuration craig suggested above.