1 Reply Latest reply on Nov 15, 2012 9:46 AM by sannegrinovero

    Reducing blocking/non-availability time during state transfer after node added/removed (DIST_SYNC)

    bjs

      We're doing some basic testing to evaluate the robustness of Infinispan 5.1.6, with 4 machines running a DIST_SYNC configuration (2 owners per key) and 1,000,000 entries (approx 1 GB of Java strings in total), in a single cache, with one of the 4 machines acting as a client continually performing .get() and .replace() operations to measure throughput.

       

      Using a simple programmatic configuration with no jgroups XML file (see below), we're seeing availability pauses (i.e. the client blocks while performing an operation so throughput falls to zero) of about 10 seconds just after the client detects that one of the remote nodes has been killed or restarted. We'd really like to reduce this if possible, ideally we'd like Infinispan to be able to continue to execute operations (even if throughput is a bit lower than normal during state transfer), since I'd expect Infinispan to be able to successfully and immediately service the operation using the other node that contains the key (since it's already detected that the original node was down).

       

      Worse, when we use the default in-the-box etc/jgroups-tcp.xml config file (which I hoped to tweak to improve the performance) the behaviour is radically different and after a node is killed/restarted we see operations block for approx 90 seconds, which is not very available at all. I don't know what differences between the default jgroups-tcp.xml file and the default behaviour when jgroups isn't explicitly configured would cause this. I also tried the etc/config-samples/jgroups-tcp.xml file and that gave a pause of about 60 seconds. With this configuration, we often see "TimeoutException: Timed out waiting for the state transfer to end", e.g.

      2012-11-14 17:42:54.812    1.0s INFO  [main] : Throughput measurement: 7889

      2012-11-14 17:42:56.834    2.0s INFO  [main] : Throughput measurement: 1339

      2012-11-14 17:42:56.837         INFO  [Incoming-2,thug-14972] : InfiniSpan ViewChanged clusterSize=3; addr=thug-14972; old=[vandal-25742, thug-16117, goth-17820, thug-14972]; new=[thug-16117, goth-17820, thug-14972]; id=4

      2012-11-14 17:43:50.441   53.6s ERROR [main] : Exception replacing key/value: table_rahj:343703abcd/myuwwkrxnfmqgeebeoapezsdzspmqcxjtgdyxkrpvmwmmpmpylwrkvmeozgboqayhufojcmxghpteqrgfnzdjsjggwxhtnskcwajiwsnlzxiffloprezetukwwhtwxiverrcitacuefyjktxafwaxbdxlixxdwxhjctsalrmgbcuspziiavkadexabdiwxhqjbhgyzdccbexypedctcmvzumetfsaiegjuepkgoxxauidbyflwycilhptaoobmpldegyrtktcflqwlxgjpihcjarszbnuojsimcbolzheqrqjlbqorrepnhuagxqyganbbptjtizbjdfmukyxxbmileaksrbjavldthitldfowirhhruatgshxbxtidyjofgpqrnabgdknitcmfllyppaymgdgfwvvvpjncqhxuaukuhwruwvtdajqopxhacjxvqohkuovpmxoxeveypmwbwfzizedvmxsfhhoslqlukxgoavchvdvgyloealfpunupqkwlqctlulpliywrbgdwbxvtdlgtxlhahcnppkbyxahzbgizpsprwjwiymmxiznckjdrhiujhpemagerzwhavdbqxptyyvxfvicehzhqchghqaaaigsxhhuvarwdfkpistrdevhkgomneqlahxgkgcxlhvykmkflxyhqrpletzpmpmebilnlhavokalutlssyaxalnmmtxmeeyrubmfjwtavbkqcnaaanhzafpwcsrhjasdhqycfutuswyzyodvpbsxyqhuueksvynyimboqjlhdjfhdnwbruxzbvoualqdqlrqdjweuplityexmrfylrjshcpctlmkswaswerngnsyixujjtygzizhhycfjkolhcsckjzypplxyblitgphvkjlgpfexzsfhttjklvlbxpacesklbamhvcwhqchtaymjmklckgklfhpellbxhhukpnktlvuhmmzhrxyqsxuepfcym343703_43707abcd:

      org.infinispan.util.concurrent.TimeoutException: Timed out waiting for the state transfer to end

                at org.infinispan.interceptors.StateTransferLockInterceptor.handleWithRetries(StateTransferLockInterceptor.java:223)

                at org.infinispan.interceptors.StateTransferLockInterceptor.handleWriteCommand(StateTransferLockInterceptor.java:181)

                at org.infinispan.interceptors.StateTransferLockInterceptor.visitReplaceCommand(StateTransferLockInterceptor.java:162)

                at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:66)

                at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:116)

                at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:132)

                at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:91)

                at org.infinispan.commands.AbstractVisitor.visitReplaceCommand(AbstractVisitor.java:72)

                at org.infinispan.commands.write.ReplaceCommand.acceptVisitor(ReplaceCommand.java:66)

                at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:345)

                at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1006)

                at org.infinispan.CacheImpl.replace(CacheImpl.java:754)

                at org.infinispan.CacheImpl.replace(CacheImpl.java:747)

                at org.infinispan.CacheSupport.replace(CacheSupport.java:148)

                at com.apama.test.correlator.storage.distributed.provider.AbstractDistStorageProvider.casReplace(AbstractDistStorageProvider.java:26)

                at Test.go(Test.java:148)

                at Test.main(Test.java:85)

      2012-11-14 17:43:50.442         INFO  [main] : Throughput measurement: 0

      2012-11-14 17:43:51.443    1.0s INFO  [main] : Throughput measurement: 9312

       

       

      Is unresponsiveness/blocking during state transfer the expected/intended behaviour of Infinispan, or is this something that can be configured away?

       

      Our current config looks like this:

                  GlobalConfiguration globalConfig = new GlobalConfigurationBuilder()
                                    .clusteredDefault()
                                    .transport().defaultTransport()
                                    .addProperty("configurationFile", "etc/jgroups-tcp.xml") // explicitly use default jgroups config file, or comment this line to use default programmatic configuration
                                    .clusterName("myCluster")
                                    .build();
                Configuration config = new ConfigurationBuilder()
                                    .clustering()
                                              .cacheMode(CacheMode.DIST_SYNC)
                                              .l1()
                                                        .disable()
                                              .hash()
                                                        .numOwners(2) // 2 replicas
                                    .build();
                mgr = new DefaultCacheManager(globalConfig, config);
      
      

       

      Many thanks, Ben