7 Replies Latest reply on May 10, 2017 5:26 PM by purushos

Issues with Infinispan Partition Handling

purushos Apr 27, 2017 6:43 PM

Hi,

We have an Infinispan server cluster (cluster size 3). We have defined a cache container and a replicated cache with partition handling enabled in the default cluster configuration file (clustered.xml). We observed that when shut down 2 of the Infinispan servers and bring them back up, they fail to join the cluster due to State Transfer timeout error. The state transfer timeout which we have configured for the cache is 5 mins. The error is observed even when the cache is empty hence no reason to increase the timeout.

We also verified that the issue occurs only when partition handling is enabled. The Infinispan server version which we have deployed is 8.2.6.

We use TCPPING as the discovery protocol to form the cluster.

Cache configuration:

<replicated-cache name="default" mode="SYNC" batching="true">

<partition-handling enabled="true"/>

<state-transfer timeout="300000"/>

</replicated-cache>

The following exception is observed in the nodes which fail to rejoin the cluster:

27-Apr-2017 22:21:34,985 ERROR [org.jboss.msc.service.fail] (MSC service thread 1-7) <:> MSC000001: Failed to start service jboss.datagrid-infinispan.vsd.default: org.jboss.msc.service.StartException in service jboss.datagrid-infinispan.vsd.default: Failed to start service

at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1904) [jboss-msc-1.2.6.Final.jar:1.2.6.Final]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_111]

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_111]

at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_111]

Caused by: org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete() throws java.lang.Exception on object of type StateTransferManagerImpl

at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:172)

at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:859)

at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:628)

at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:617)

at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:542)

at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:238)

at org.infinispan.cache.impl.CacheImpl.start(CacheImpl.java:862)

at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:635)

at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:585)

at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:451)

at org.infinispan.manager.impl.AbstractDelegatingEmbeddedCacheManager.getCache(AbstractDelegatingEmbeddedCacheManager.java:133)

at org.infinispan.server.infinispan.SecurityActions$5.run(SecurityActions.java:130)

at org.infinispan.server.infinispan.SecurityActions$5.run(SecurityActions.java:127)

at org.infinispan.security.Security.doPrivileged(Security.java:76)

at org.infinispan.server.infinispan.SecurityActions.doPrivileged(SecurityActions.java:63)

at org.infinispan.server.infinispan.SecurityActions.startCache(SecurityActions.java:135)

at org.jboss.as.clustering.infinispan.subsystem.CacheService.start(CacheService.java:86)

at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1948) [jboss-msc-1.2.6.Final.jar:1.2.6.Final]

at org.jboss.msc.service.ServiceControllerImpl$StartTask.run(ServiceControllerImpl.java:1881) [jboss-msc-1.2.6.Final.jar:1.2.6.Final]

... 3 more

Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache default on vsd-kmurthy-set2-node1.mv.nuagenetworks.net

at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:217)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.8.0_111]

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [rt.jar:1.8.0_111]

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.8.0_111]

at java.lang.reflect.Method.invoke(Method.java:498) [rt.jar:1.8.0_111]

at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:168)

... 21 more

1. Re: Issues with Infinispan Partition Handling

pruivo May 2, 2017 5:33 AM (in response to purushos)

Hi,

Could you attach the your clustered.xml file? I'll take a look.

Cheers,
Pedro
Actions
2. Re: Issues with Infinispan Partition Handling

purushos May 2, 2017 12:58 PM (in response to pruivo)
Hi Pedro,

Please find the configuration file attached.

Thanks,
Purush

clustered.xml.zip 3.4 KB
Actions
3. Re: Issues with Infinispan Partition Handling

purushos May 3, 2017 4:47 PM (in response to purushos)
Hi Pedro,

Please ignore the configuration which I uploaded earlier. Please refer the attached configuration which has partition handling enabled in all our custom caches. The problem is observed when partition handling is enabled in our cache. Sorry about the confusion caused.

Thanks,
Purush

clustered.xml.zip 3.4 KB
Actions
4. Re: Issues with Infinispan Partition Handling

pruivo May 4, 2017 10:13 AM (in response to purushos)

Hi,

I had some time today and I checked your configuration and I was able to reproduce it locally.
I've created a JIRA to track this bug: [ISPN-7800] Cluster always in Degraded Mode - JBoss Issue Tracker

Feel free to comment here or in the JIRA if you more info or questions.

Cheers,
Pedro
Actions
5. Re: Issues with Infinispan Partition Handling

purushos May 4, 2017 1:14 PM (in response to pruivo)

Hi Pedro,

Thanks! Are there any work arounds for this? We would prefer to enable partition handling to ensure data consistency. However, if there is a way for the cache client to have a listener to reliably monitor the cluster size, we can have an implementation on cache client to avoid cache updates when the 'cluster size < Actual cluster size/2'. With an embedded cache client, we could monitor the cluster size using ViewChanged and Merged listeners.

Also, if you suspect that this is a bug introduced in 8.2.6 we could downgrade Infinispan to an older release until the issue is fixed.

Thanks,
Purush
Actions
6. Re: Issues with Infinispan Partition Handling

dan.berindei May 8, 2017 3:36 PM (in response to purushos)
Hi Purush

I'm afraid this is more or less how partition handling is supposed to work: once a cache enters degraded mode, new nodes cannot join it, and the only way it can exit degraded mode without user intervention is with a merge. In your case, there is no merge: the restarted nodes have completely different JGroups address, so there they join the cluster as if they were never started before.

We have an enhancement request to use the persistent UUID and exit degraded mode automatically after a node restart, however we haven't started working on it yet: [ISPN-5290] Better automatic merge for caches with enabled partition handling.

There are two possible workarounds:
Change the cache's availability mode after you've restarted the nodes, either via JMX or via the CLI.
Try to avoid entering degraded mode. Normally, when you stop a node, it will send a leave request to the coordinator, and the coordinator will know to keep the caches available as long as no data has been lost. But if the coordinator also stops, the new coordinator will not know about the leave requests received by the old coordinator, so a cache may enter degraded mode even when it would be safe to stay available.
Actions
7. Re: Issues with Infinispan Partition Handling

purushos May 10, 2017 5:26 PM (in response to dan.berindei)

Hi Dan,

Thanks for the explanation. Any idea when the enhancement request would be implemented?

We have seen very occasionally that both the restarted nodes were able to join the cluster successfully. Could this be due to one node (which was up and running) within the cluster being the coordinator for the other 2 restarted nodes and hence the cache never became Degraded?

Also, is there anything which we could implement within our cache client as a work around until the partition handling issue is fixed? Can we have a cluster view listener within the Hot Rod client?

Thanks,
Purush
Actions

Go to original post