1 Reply Latest reply on Nov 15, 2012 9:46 AM by sannegrinovero

removed (DIST_SYNC)

bjs Nov 15, 2012 6:28 AM

We're doing some basic testing to evaluate the robustness of Infinispan 5.1.6, with 4 machines running a DIST_SYNC configuration (2 owners per key) and 1,000,000 entries (approx 1 GB of Java strings in total), in a single cache, with one of the 4 machines acting as a client continually performing .get() and .replace() operations to measure throughput.

Using a simple programmatic configuration with no jgroups XML file (see below), we're seeing availability pauses (i.e. the client blocks while performing an operation so throughput falls to zero) of about 10 seconds just after the client detects that one of the remote nodes has been killed or restarted. We'd really like to reduce this if possible, ideally we'd like Infinispan to be able to continue to execute operations (even if throughput is a bit lower than normal during state transfer), since I'd expect Infinispan to be able to successfully and immediately service the operation using the other node that contains the key (since it's already detected that the original node was down).

Worse, when we use the default in-the-box etc/jgroups-tcp.xml config file (which I hoped to tweak to improve the performance) the behaviour is radically different and after a node is killed/restarted we see operations block for approx 90 seconds, which is not very available at all. I don't know what differences between the default jgroups-tcp.xml file and the default behaviour when jgroups isn't explicitly configured would cause this. I also tried the etc/config-samples/jgroups-tcp.xml file and that gave a pause of about 60 seconds. With this configuration, we often see "TimeoutException: Timed out waiting for the state transfer to end", e.g.

2012-11-14 17:42:54.812 1.0s INFO [main] : Throughput measurement: 7889

2012-11-14 17:42:56.834 2.0s INFO [main] : Throughput measurement: 1339

2012-11-14 17:42:56.837 INFO [Incoming-2,thug-14972] : InfiniSpan ViewChanged clusterSize=3; addr=thug-14972; old=[vandal-25742, thug-16117, goth-17820, thug-14972]; new=[thug-16117, goth-17820, thug-14972]; id=4

2012-11-14 17:43:50.441 53.6s ERROR [main] : Exception replacing key/value: table_rahj:343703abcd/myuwwkrxnfmqgeebeoapezsdzspmqcxjtgdyxkrpvmwmmpmpylwrkvmeozgboqayhufojcmxghpteqrgfnzdjsjggwxhtnskcwajiwsnlzxiffloprezetukwwhtwxiverrcitacuefyjktxafwaxbdxlixxdwxhjctsalrmgbcuspziiavkadexabdiwxhqjbhgyzdccbexypedctcmvzumetfsaiegjuepkgoxxauidbyflwycilhptaoobmpldegyrtktcflqwlxgjpihcjarszbnuojsimcbolzheqrqjlbqorrepnhuagxqyganbbptjtizbjdfmukyxxbmileaksrbjavldthitldfowirhhruatgshxbxtidyjofgpqrnabgdknitcmfllyppaymgdgfwvvvpjncqhxuaukuhwruwvtdajqopxhacjxvqohkuovpmxoxeveypmwbwfzizedvmxsfhhoslqlukxgoavchvdvgyloealfpunupqkwlqctlulpliywrbgdwbxvtdlgtxlhahcnppkbyxahzbgizpsprwjwiymmxiznckjdrhiujhpemagerzwhavdbqxptyyvxfvicehzhqchghqaaaigsxhhuvarwdfkpistrdevhkgomneqlahxgkgcxlhvykmkflxyhqrpletzpmpmebilnlhavokalutlssyaxalnmmtxmeeyrubmfjwtavbkqcnaaanhzafpwcsrhjasdhqycfutuswyzyodvpbsxyqhuueksvynyimboqjlhdjfhdnwbruxzbvoualqdqlrqdjweuplityexmrfylrjshcpctlmkswaswerngnsyixujjtygzizhhycfjkolhcsckjzypplxyblitgphvkjlgpfexzsfhttjklvlbxpacesklbamhvcwhqchtaymjmklckgklfhpellbxhhukpnktlvuhmmzhrxyqsxuepfcym343703_43707abcd:

org.infinispan.util.concurrent.TimeoutException: Timed out waiting for the state transfer to end

at org.infinispan.interceptors.StateTransferLockInterceptor.handleWithRetries(StateTransferLockInterceptor.java:223)

at org.infinispan.interceptors.StateTransferLockInterceptor.handleWriteCommand(StateTransferLockInterceptor.java:181)

at org.infinispan.interceptors.StateTransferLockInterceptor.visitReplaceCommand(StateTransferLockInterceptor.java:162)

at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:66)

at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)

at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:132)

at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:91)

at org.infinispan.commands.AbstractVisitor.visitReplaceCommand(AbstractVisitor.java:72)

at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:66)

at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:345)

at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1006)

at org.infinispan.CacheImpl.replace(CacheImpl.java:754)

at org.infinispan.CacheImpl.replace(CacheImpl.java:747)

at org.infinispan.CacheSupport.replace(CacheSupport.java:148)

at com.apama.test.correlator.storage.distributed.provider.AbstractDistStorageProvider.casReplace(AbstractDistStorageProvider.java:26)

at Test.go(Test.java:148)

at Test.main(Test.java:85)

2012-11-14 17:43:50.442 INFO [main] : Throughput measurement: 0

2012-11-14 17:43:51.443 1.0s INFO [main] : Throughput measurement: 9312

Is unresponsiveness/blocking during state transfer the expected/intended behaviour of Infinispan, or is this something that can be configured away?

Our current config looks like this:

            GlobalConfiguration globalConfig = new GlobalConfigurationBuilder()
                              .clusteredDefault()
                              .transport().defaultTransport()
                              .addProperty("configurationFile", "etc/jgroups-tcp.xml") // explicitly use default jgroups config file, or comment this line to use default programmatic configuration
                              .clusterName("myCluster")
                              .build();
          Configuration config = new ConfigurationBuilder()
                              .clustering()
                                        .cacheMode(CacheMode.DIST_SYNC)
                                        .l1()
                                                  .disable()
                                        .hash()
                                                  .numOwners(2) // 2 replicas
                              .build();
          mgr = new DefaultCacheManager(globalConfig, config);

Many thanks, Ben

1. Re: Reducing blocking/non-availability time during state transfer after node added/removed (DIST_SYNC)

sannegrinovero Nov 15, 2012 9:46 AM (in response to bjs)

Hi Ben,
that's a limitation of Infinispan 5.1: the process of rebalancing entries which is triggered after a new cluster view is installed is called state transfer. The most interesting new feature of Infinispan 5.2 is what we called NBST: Non-blocking-state-transfer.

5.2 wasn't released yet as a final but you can test the latest beta.

Here you can find the design documentation about the new feature:
https://community.jboss.org/wiki/Non-BlockingStateTransferV2

You might still experience someslowdowns after a node is killed as the network will be under stress by the state transfer being performed, but the applications won't be blocked.
Actions