1 Reply Latest reply on Mar 17, 2015 7:17 AM by pruivo

STATE_TRANSFER timeout after View Change (MERGE)

ggarciao Mar 12, 2015 7:17 AM

Hello everyone,

I've a very frustrating situation that I do not manage to solve easily. Here is the scenario: We have a cluster with N members with several embedded DIST caches (owners=2) and a few REPL caches.

We want to upgrade each member of the cluster without service interruption, so here is what we do:

We split the cluster in different partitions of size m. So, we have N/m partitions (with m<N of course)
For each partition, we stop one partition, we upgrade it and we restart it

FYI: We are using jbossas an application server so the partitions are servers groups. Thanks to mod_cluster we avoid requests getting to the 'unstable/restarting' partition.

So, this always fails. Doing the first partition upgrade, in the logs we are able to see:

Infinispan is considering that the restarted partition is a 'lost/disconnected' partition. Based on this log, we assume that infinispan is trying to handle a cluster partition

[Server:server-one-QA] 11:10:46,010 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-6,server-one-QA-44409, site-id=, rack-id=, machine-id=group1) ISPN000093: Received new,

MERGED cluster view for channel QA-CLUSTER: MergeView::[server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2|9] (5) [server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2,

server-one-QA-44409, site-id=null, rack-id=null, machine-id=group1, server-four-QA-28029, site-id=null, rack-id=null, machine-id=group2, server-five-QA-53748, site-id=null, rack-id=null, machine-id=group3, s

erver-six-QA-41240, site-id=null, rack-id=null, machine-id=group3], 2 subgroups: [server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2|7] (4) [server-three-QA-10596, site-id=null, rack-id=nul

l, machine-id=group2, server-four-QA-28029, site-id=null, rack-id=null, machine-id=group2, server-five-QA-53748, site-id=null, rack-id=null, machine-id=group3, server-six-QA-41240, site-id=null, rack-id=null

, machine-id=group3], [server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2|8] (5) [server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2, server-four-QA-28029, site-id=null, r

ack-id=null, machine-id=group2, server-five-QA-53748, site-id=null, rack-id=null, machine-id=group3, server-six-QA-41240, site-id=null, rack-id=null, machine-id=group3, server-one-QA-44409, site-id=null, rac

k-id=null, machine-id=group1]

After a while, the first node of the new partition start throwing TimeoutException due to a no-response of a member of other partition

Caused by: org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl

at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:170) [infinispan-commons-7.0.0.Final.jar:7.0.0.Final]

at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:869) [infinispan-core-7.0.0.Final.jar:7.0.0.Final]

at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:638) [infinispan-core-7.0.0.Final.jar:7.0.0.Final]

...

Caused by: org.jgroups.TimeoutException: timeout waiting for response from server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2, request: org.jgroups.blocks.UnicastRequest@2ce9ff0f, req_id=2, mode=GET_ALL, target=server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2

at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:429) [jgroups-3.6.0.Final.jar:3.6.0.Final]

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:372) [infinispan-core-7.0.0.Final.jar:7.0.0.Final]

at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:167) [infinispan-core-7.0.0.Final.jar:7.0.0.Final]

... 157 more

And a lot of this JGROUP warnings

11:14:54,614 WARN [org.jgroups.protocols.pbcast.GMS] (Incoming-9,server-one-QA-44409, site-id=, rack-id=, machine-id=group1) server-one-QA-44409, site-id=, rack-id=, machine-id=group1: failed to collect all ACKs (expected=5) for view [server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2|19] after 2000ms, missing 5 ACKs from (5) server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2, server-one-QA-44409, site-id=null, rack-id=null, machine-id=group1, server-four-QA-28029, site-id=null, rack-id=null, machine-id=group2, server-five-QA-53748, site-id=null, rack-id=null, machine-id=group3, server-six-QA-41240, site-id=null, rack-id=null, machine-id=group3

If our strategy is supposed to work, what we can do? If it is not ... what we can do to upgrade a cluster without interrupting the service?

Thanks in advance!

1. Re: STATE_TRANSFER timeout after View Change (MERGE)

pruivo Mar 17, 2015 7:17 AM (in response to ggarciao)

Hi Guillermo,

If you are upgrading Infinispan, please follow this guide: http://infinispan.org/docs/7.1.x/user_guide/user_guide.html#_Rolling_chapter

Cheers,
Pedro
Actions