Reducing blocking/non-availability time during state transfer after node added/removed (DIST_SYNC)
bjs Nov 15, 2012 6:28 AMWe're doing some basic testing to evaluate the robustness of Infinispan 5.1.6, with 4 machines running a DIST_SYNC configuration (2 owners per key) and 1,000,000 entries (approx 1 GB of Java strings in total), in a single cache, with one of the 4 machines acting as a client continually performing .get() and .replace() operations to measure throughput.
Using a simple programmatic configuration with no jgroups XML file (see below), we're seeing availability pauses (i.e. the client blocks while performing an operation so throughput falls to zero) of about 10 seconds just after the client detects that one of the remote nodes has been killed or restarted. We'd really like to reduce this if possible, ideally we'd like Infinispan to be able to continue to execute operations (even if throughput is a bit lower than normal during state transfer), since I'd expect Infinispan to be able to successfully and immediately service the operation using the other node that contains the key (since it's already detected that the original node was down).
Worse, when we use the default in-the-box etc/jgroups-tcp.xml config file (which I hoped to tweak to improve the performance) the behaviour is radically different and after a node is killed/restarted we see operations block for approx 90 seconds, which is not very available at all. I don't know what differences between the default jgroups-tcp.xml file and the default behaviour when jgroups isn't explicitly configured would cause this. I also tried the etc/config-samples/jgroups-tcp.xml file and that gave a pause of about 60 seconds. With this configuration, we often see "TimeoutException: Timed out waiting for the state transfer to end", e.g.
2012-11-14 17:42:54.812 1.0s INFO [main] : Throughput measurement: 7889
2012-11-14 17:42:56.834 2.0s INFO [main] : Throughput measurement: 1339
2012-11-14 17:42:56.837 INFO [Incoming-2,thug-14972] : InfiniSpan ViewChanged clusterSize=3; addr=thug-14972; old=[vandal-25742, thug-16117, goth-17820, thug-14972]; new=[thug-16117, goth-17820, thug-14972]; id=4
2012-11-14 17:43:50.441 53.6s ERROR [main] : Exception replacing key/value: table_rahj:343703abcd/myuwwkrxnfmqgeebeoapezsdzspmqcxjtgdyxkrpvmwmmpmpylwrkvmeozgboqayhufojcmxghpteqrgfnzdjsjggwxhtnskcwajiwsnlzxiffloprezetukwwhtwxiverrcitacuefyjktxafwaxbdxlixxdwxhjctsalrmgbcuspziiavkadexabdiwxhqjbhgyzdccbexypedctcmvzumetfsaiegjuepkgoxxauidbyflwycilhptaoobmpldegyrtktcflqwlxgjpihcjarszbnuojsimcbolzheqrqjlbqorrepnhuagxqyganbbptjtizbjdfmukyxxbmileaksrbjavldthitldfowirhhruatgshxbxtidyjofgpqrnabgdknitcmfllyppaymgdgfwvvvpjncqhxuaukuhwruwvtdajqopxhacjxvqohkuovpmxoxeveypmwbwfzizedvmxsfhhoslqlukxgoavchvdvgyloealfpunupqkwlqctlulpliywrbgdwbxvtdlgtxlhahcnppkbyxahzbgizpsprwjwiymmxiznckjdrhiujhpemagerzwhavdbqxptyyvxfvicehzhqchghqaaaigsxhhuvarwdfkpistrdevhkgomneqlahxgkgcxlhvykmkflxyhqrpletzpmpmebilnlhavokalutlssyaxalnmmtxmeeyrubmfjwtavbkqcnaaanhzafpwcsrhjasdhqycfutuswyzyodvpbsxyqhuueksvynyimboqjlhdjfhdnwbruxzbvoualqdqlrqdjweuplityexmrfylrjshcpctlmkswaswerngnsyixujjtygzizhhycfjkolhcsckjzypplxyblitgphvkjlgpfexzsfhttjklvlbxpacesklbamhvcwhqchtaymjmklckgklfhpellbxhhukpnktlvuhmmzhrxyqsxuepfcym343703_43707abcd:
org.infinispan.util.concurrent.TimeoutException: Timed out waiting for the state transfer to end
at org.infinispan.interceptors.StateTransferLockInterceptor.handleWithRetries(StateTransferLockInterceptor.java:223)
at org.infinispan.interceptors.StateTransferLockInterceptor.handleWriteCommand(StateTransferLockInterceptor.java:181)
at org.infinispan.interceptors.StateTransferLockInterceptor.visitReplaceCommand(StateTransferLockInterceptor.java:162)
at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:66)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)
at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:132)
at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:91)
at org.infinispan.commands.AbstractVisitor.visitReplaceCommand(AbstractVisitor.java:72)
at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:66)
at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:345)
at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1006)
at org.infinispan.CacheImpl.replace(CacheImpl.java:754)
at org.infinispan.CacheImpl.replace(CacheImpl.java:747)
at org.infinispan.CacheSupport.replace(CacheSupport.java:148)
at com.apama.test.correlator.storage.distributed.provider.AbstractDistStorageProvider.casReplace(AbstractDistStorageProvider.java:26)
at Test.go(Test.java:148)
at Test.main(Test.java:85)
2012-11-14 17:43:50.442 INFO [main] : Throughput measurement: 0
2012-11-14 17:43:51.443 1.0s INFO [main] : Throughput measurement: 9312
Is unresponsiveness/blocking during state transfer the expected/intended behaviour of Infinispan, or is this something that can be configured away?
Our current config looks like this:
GlobalConfiguration globalConfig = new GlobalConfigurationBuilder() .clusteredDefault() .transport().defaultTransport() .addProperty("configurationFile", "etc/jgroups-tcp.xml") // explicitly use default jgroups config file, or comment this line to use default programmatic configuration .clusterName("myCluster") .build(); Configuration config = new ConfigurationBuilder() .clustering() .cacheMode(CacheMode.DIST_SYNC) .l1() .disable() .hash() .numOwners(2) // 2 replicas .build(); mgr = new DefaultCacheManager(globalConfig, config);
Many thanks, Ben