11 Replies Latest reply on Jan 27, 2014 2:12 PM by jfrost001

CacheException: Initial state transfer timed out for cache

dex80526 Dec 30, 2013 3:11 PM

I have a 3 node cluster with replication configuration. I am seeing the "Initial state transfer timed out exception" during the initial start up of the cluster. This is definitely related to the number of entries in the cache store (we used rdmb/apache derby as persistent store). The time out exception happens when there are about 1 million cache entries.

I already have a very big timeout value in the configure (see below).

To me this is a critical scaling issue. I saw this timeout happens more frequently when the size of cluster (number of nodes) goes up to 5 or higher.

Anyone has any suggestion to work around this?

Here are the configuration for the cache:

....

</default>

</clustering>

....

1. Re: CacheException: Initial state transfer timed out for cache

mircea.markus Jan 9, 2014 10:00 AM (in response to dex80526)

try increase the chunksSize?
Actions
2. Re: CacheException: Initial state transfer timed out for cache

dan.berindei Jan 16, 2014 6:29 AM (in response to mircea.markus)

Can you try removing the RSVP protocol from the JGroups configuration?

We have noticed a problem with state transfer timing out because RSVP ACK messages are blocked by other commands: [ISPN-3878] Unhandled failing ST cancel leads to deadlock, and we think removing the RSVP protocol should fix it. If you're not using UNICAST3 yet, I would recommend upgrading to that as well.
Actions
3. Re: CacheException: Initial state transfer timed out for cache

dex80526 Jan 16, 2014 3:43 PM (in response to dan.berindei)

Just give a update on what I ended up for now.
I ended up to set the "awaitInitialTransfer="false" for the stateTransfer. With this change, the initial startup of a cluster is faster and reliable (i.e. not seeing the timeout issue). Of course, I took the trade of the cache data sync.
I also increased the chuckSize to 10K.

Please do let me know if you see any issues with my configuration (awaitInitialTransfer=false).

Dan, ISPN-3878 seems to match our problem. I'll try UNICAST3 when time allows to.

Thanks.
Actions
4. Re: CacheException: Initial state transfer timed out for cache

jfrost001 Jan 21, 2014 8:19 PM (in response to dex80526)
I may need to open a new thread. I seem to be having similar problems. First let me say I tried setting awaitIntialTransfer=false and I commented out RSVP in the jgroups config. And I have UNICAST3 enabled. No change.

When my cache has 500-1000 entries and a new node joins the cluster I get the Timeout error on the coordinator:
Caused by: org.jgroups.TimeoutException: timeout sending message to jfrost-multi2-25342 at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:419) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:353) at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:167)

If I have something in excess of a thousand(I don't have the exact number). I also get a bunch of entries like those below before the timeout occurs.:

137133 [ERROR] UDP: - JGRP000029: jfrost-multi-59483: failed sending message to jfrost-multi2-25342 (30077 bytes): java.lang.Exception: dest=/192.168.10.193:30198 (30080 bytes), headers: FRAG2: [id=1, frag_id=22, num_frags=63], UNICAST3: DATA, seqno=29, conn_id=1, UDP: [channel_name=ISPN]

This doesn't bode well. This doesn't seem like a complex case and I'm having trouble. I can post my jgroups config file but it's pretty much the jgroups-udp.xml file supplies as a sample with Infinispan with the tweaks mentioned above. If I can't get this simple case to work I may have to abandon the project.

The really hard thing to deal with is the data seems to making it to the new node. The error shows up sometime after the data was transferred. Is this only a problem with initial state transfer or something that could show up anytime?
Actions
5. Re: CacheException: Initial state transfer timed out for cache

pruivo Jan 23, 2014 10:39 AM (in response to jfrost001)

Hi Jack,

do you have a log file with the problem? can you upload it together with the configuration used (infinispan + jgroups)?

Cheers,
Pedro
Actions
6. Re: Re: CacheException: Initial state transfer timed out for cache

jfrost001 Jan 23, 2014 6:35 PM (in response to pruivo)
Hey Pedro,

Thanks for noticing my problem.

For this run I was using Infinispan 6.0.0 embedded in a Tomcat 6 container running on a FreeBSD server.

Attached you should find a reasonably short log. It's running pretty much the defaults from the log4j.xml sample I found on the Infinispan site. It is relatively easy to run the test so if you would like a different set of logging parameters let me know.

In this particular case there were only 100 entries in the cache. So the buffer problem does not seem to be present and as far as I can tell is not directly associated with this time out. The key and value for the cache are very simple. The key is a serializable pair made up of two Integers and the value is an Integer.

It is true that the timeout seems to occur in RSVP if it is be used. RSVP is not in this particular run. You can also see that awaitInitialResponse is also set false as per earlier recommendation as well as a 10K chunkSize. I have the Transaction timeout set very high trying to ease debugging. I would use something more reasonable in a live deployment.

Below you will find the code patch I'm using for Infinispan config and I'm attaching the Jgroups XML config file.

I've been checking both Inifinspan 5.3.0 and 6.0.0. I5.3 uses jgroups 3.3.1 while 6.0.0 ships with jgroups 3.4.1. The timeout seems to be common to both.

Let me know if you need any other info.

/Jack

DefaultCacheManager cacheManager = new DefaultCacheManager( GlobalConfigurationBuilder.defaultClusteredBuilder() .globalJmxStatistics() .enable() .transport() .addProperty("configurationFile", "jgroups-ee-udp.xml") .build(), new ConfigurationBuilder() .jmxStatistics() .enable() .clustering() .cacheMode(CacheMode.REPL_ASYNC) .stateTransfer() .fetchInMemoryState(true) .awaitInitialTransfer(false) .chunkSize(10240) .timeout(2, TimeUnit.MINUTES) .transaction() .transactionMode(TransactionMode.TRANSACTIONAL) .transactionManagerLookup(new JBossStandaloneJTAManagerLookup()) .completedTxTimeout(DateTimeUtil.MILLISECONDS_PER_MINUTE * 20) .build());

infinispan.log.zip 2.6 KB

jgroups-ee-udp.xml 2.8 KB
Actions
7. Re: Re: CacheException: Initial state transfer timed out for cache

pruivo Jan 24, 2014 6:03 AM (in response to jfrost001)

Hi Jack,

Unfortunately I'm unable to identify the problem from the logs. Can you reproduce the test with the following changes?

* TRACE log in org.infinispan.remoting package and org.jgroups.protocol.UDP class
* It would be nice if you can take a thread dump of both servers during that 2 min window
* (optional) make each server to log in different files

Thanks

Pedro
Actions
8. Re: Re: Re: CacheException: Initial state transfer timed out for cache

jfrost001 Jan 24, 2014 3:04 PM (in response to pruivo)
Hey Pedro,

No problem. I think I have the info you want. It wasn't quite clear if you want the UDP separate from the infinispan logging or your really meant servers. But that's what I did. There are two nodes in the test. The zip file contains an infinispan log, a udp log and a threaddump I triggered about 1 minute into the timeout period.

Just to recap, Node1 comes up creates the cache and loads some entries into the cache. After pre-load is over and node1 is stable I start node2. Timeout is about 2 minutes later as per the configuration value. Timeout is on Node1 the cluster coordinator.

In getting the files It still seems to me like there is a cache entry size/count dependency. In a few cases where entries have been <= 100 the timeout doesn't seem to occur. Then there is a zone where the timeout occurs but all else seems fine. Finally if I load up a large sample, >=5000 I'm hitting some kind of send failure which appears on the surface to be buffering related. But I'm worried this is a red herring as the 5.3 release reports "No buffers available" which is known to sometimes mask other low level problems. 6.0 seems to be getting the same error but the error message logged is of a slightly different form that doesn't report the cause message.

Anyway, hope these files have something interesting. Let me know if you need more.

/Jack

timeout-data.zip 26.2 KB
Actions
9. Re: Re: Re: CacheException: Initial state transfer timed out for cache

dex80526 Jan 24, 2014 3:22 PM (in response to jfrost001)

I am interested to see what root cause of this is too. We have seen similar issues in our deployments since ISPN 5.1 times, and the problem definitely depends upon the size of cluster and cache entries. In other words, this happens more often in bigger clusters (4 or 5 nodes) and big number of cache entries. Our configuration uses TCP.
Actions
10. Re: Re: Re: Re: CacheException: Initial state transfer timed out for cache

pruivo Jan 27, 2014 5:28 AM (in response to jfrost001)
Hi Jack,

I took a look at the logs and I think that is a network problem. I noticed that node2 is not receiving any messages from node1 and the latter is always trying to retry. Do you have some kind of firewall? Also, I've noticed that the small messages (<1 KB) are passing correctly (note that the state transfer messages are > 30 KB).

My suggestion is to use UUPerf in that 2 nodes to try to find out the problem. It mimics the state transfer.

java -cp /path/to/jgroups.jar org.jgroups.tests.perf.UUPerf -props /path/to/jgroups-ee-udp.xml

When it starts, it shows some options and you can configure the number of messages to send (default 1) and the size of them (default 4.5 MB).

Other suggestion that I can give is to increase the buffers in jgroups configuration file and/or in the operating system.

if you more need help let me know.

Cheers,
Pedro
Actions
11. Re: Re: Re: Re: CacheException: Initial state transfer timed out for cache

jfrost001 Jan 27, 2014 2:12 PM (in response to pruivo)

Thanks Pedro.

I've had our sysadmin check the buffer settings but I'll try the test program and see if anything else comes to light. I also had the jgroups buffers turned up higher at one point but was worried that I had over done it. I'll make them a little larger and see what happens.

It's starting to seem like my two problems are related. As I mentioned when there are lots of cache entries(> 5000) I start getting buffer errors from Infinispan/jgroups. It's seems like there is a process that is using buffers and not returning them as the starvation seems to get worse as time system keeps trying to send messages. Seems like that could be linked to a process that isn't running properly.

I'll get back with any results.

/Jack
Actions

Go to original post