7 Replies Latest reply on May 10, 2017 5:26 PM by purushos

    Issues with Infinispan Partition Handling

    purushos

      Hi,

       

      We have an Infinispan server cluster (cluster size 3). We have defined a cache container and a replicated cache with partition handling enabled in the default cluster configuration file (clustered.xml). We observed that when shut down 2 of the Infinispan servers and bring them back up, they fail to join the cluster due to State Transfer timeout  error. The state transfer timeout which we have configured for the cache is 5 mins. The error is observed even when the cache is empty hence no reason to increase the timeout.

       

      We also verified that the issue occurs only when partition handling is enabled. The Infinispan server version which we have deployed is 8.2.6.

       

      We use TCPPING as the discovery protocol to form the cluster.

       

      Cache configuration:

       

          <transport lock-timeout="60000"/>

          <replicated-cache name="default" mode="SYNC" batching="true">

              <partition-handling enabled="true"/>

              <locking isolation="REPEATABLE_READ"/>

              <state-transfer timeout="300000"/>

          </replicated-cache>

       

      The following exception is observed in the nodes which fail to rejoin the cluster:

       

      27-Apr-2017 22:21:34,985 ERROR [org.jboss.msc.service.fail] (MSC service thread 1-7) <:> MSC000001: Failed to start service jboss.datagrid-infinispan.vsd.default: org.jboss.msc.service.StartException in service jboss.datagrid-infinispan.vsd.default: Failed to start service

              at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1904) [jboss-msc-1.2.6.Final.jar:1.2.6.Final]

              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_111]

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_111]

              at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_111]

      Caused by: org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete() throws java.lang.Exception on object of type StateTransferManagerImpl

              at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:172)

              at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:859)

              at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:628)

              at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:617)

              at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:542)

              at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:238)

              at org.infinispan.cache.impl.CacheImpl.start(CacheImpl.java:862)

              at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:635)

              at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:585)

              at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:451)

              at org.infinispan.manager.impl.AbstractDelegatingEmbeddedCacheManager.getCache(AbstractDelegatingEmbeddedCacheManager.java:133)

              at org.infinispan.server.infinispan.SecurityActions$5.run(SecurityActions.java:130)

              at org.infinispan.server.infinispan.SecurityActions$5.run(SecurityActions.java:127)

              at org.infinispan.security.Security.doPrivileged(Security.java:76)

              at org.infinispan.server.infinispan.SecurityActions.doPrivileged(SecurityActions.java:63)

              at org.infinispan.server.infinispan.SecurityActions.startCache(SecurityActions.java:135)

              at org.jboss.as.clustering.infinispan.subsystem.CacheService.start(CacheService.java:86)

              at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1948) [jboss-msc-1.2.6.Final.jar:1.2.6.Final]

              at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1881) [jboss-msc-1.2.6.Final.jar:1.2.6.Final]

              ... 3 more

      Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache default on vsd-kmurthy-set2-node1.mv.nuagenetworks.net

              at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:217)

              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_111]

              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_111]

              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_111]

              at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_111]

              at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:168)

              ... 21 more

        • 1. Re: Issues with Infinispan Partition Handling
          pruivo

          Hi,

           

          Could you attach the your clustered.xml file? I'll take a look.

           

          Cheers,

          Pedro

          • 2. Re: Issues with Infinispan Partition Handling
            purushos

            Hi Pedro,

             

            Please find the configuration file attached.

             

            Thanks,

            Purush

            • 3. Re: Issues with Infinispan Partition Handling
              purushos

              Hi Pedro,

               

              Please ignore the configuration which I uploaded earlier. Please refer the attached configuration which has partition handling enabled in all our custom caches. The problem is observed when partition handling is enabled in our cache. Sorry about the confusion caused.

               

              Thanks,

              Purush

              • 4. Re: Issues with Infinispan Partition Handling
                pruivo

                Hi,

                 

                I had some time today and I checked your configuration and I was able to reproduce it locally.

                I've created a JIRA to track this bug: [ISPN-7800] Cluster always in Degraded Mode - JBoss Issue Tracker

                 

                Feel free to comment here or in the JIRA if you more info or questions.

                 

                Cheers,

                Pedro

                • 5. Re: Issues with Infinispan Partition Handling
                  purushos

                  Hi Pedro,

                   

                  Thanks! Are there any work arounds for this? We would prefer to enable partition handling to ensure data consistency. However, if there is a way for the cache client to have a listener to reliably monitor the cluster size, we can have an implementation on cache client to avoid cache updates when the 'cluster size < Actual cluster size/2'. With an embedded cache client, we could monitor the cluster size using ViewChanged and Merged listeners.

                   

                  Also, if you suspect that this is a bug introduced in 8.2.6 we could downgrade Infinispan to an older release until the issue is fixed.

                   

                  Thanks,

                  Purush

                  • 6. Re: Issues with Infinispan Partition Handling
                    dan.berindei

                    Hi Purush

                     

                    I'm afraid this is more or less how partition handling is supposed to work: once a cache enters degraded mode, new nodes cannot join it, and the only way it can exit degraded mode without user intervention is with a merge. In your case, there is no merge: the restarted nodes have completely different JGroups address, so there they join the cluster as if they were never started before.

                     

                    We have an enhancement request to use the persistent UUID and exit degraded mode automatically after a node restart, however we haven't started working on it yet: [ISPN-5290] Better automatic merge for caches with enabled partition handling.

                     

                    There are two possible workarounds:

                    1. Change the cache's availability mode after you've restarted the nodes, either via JMX or via the CLI.
                    2. Try to avoid entering degraded mode. Normally, when you stop a node, it will send a leave request to the coordinator, and the coordinator will know to keep the caches available as long as no data has been lost. But if the coordinator also stops, the new coordinator will not know about the leave requests received by the old coordinator, so a cache may enter degraded mode even when it would be safe to stay available.
                    • 7. Re: Issues with Infinispan Partition Handling
                      purushos

                      Hi Dan,

                       

                      Thanks for the explanation. Any idea when the enhancement request would be implemented?

                       

                      We have seen very occasionally that both the restarted nodes were able to join the cluster successfully. Could this be due to one node (which was up and running) within the cluster being the coordinator for the other 2 restarted nodes and hence the cache never became Degraded?

                       

                      Also, is there anything which we could implement within our cache client as a work around until the partition handling issue is fixed? Can we have a cluster view listener within the Hot Rod client?

                       

                      Thanks,

                      Purush