1 Reply Latest reply on Mar 17, 2015 7:17 AM by pruivo

    STATE_TRANSFER timeout after View Change (MERGE)

    ggarciao

      Hello everyone,

       

      I've a very frustrating situation that I do not manage to solve easily. Here is the scenario: We have a cluster with N members with several embedded DIST caches (owners=2) and a few REPL caches.

       

      We want to upgrade each member of the cluster without service interruption, so here is what we do:

      • We split the cluster in different partitions of size m. So, we have N/m partitions (with m<N of course)
      • For each partition, we stop one partition, we upgrade it and we restart it

      FYI: We are using jbossas an application server so the partitions are servers groups. Thanks to mod_cluster we avoid requests getting to the 'unstable/restarting' partition.

       

      So, this always fails. Doing the first partition upgrade, in the logs we are able to see:

      • Infinispan is considering that the restarted partition is a 'lost/disconnected' partition. Based on this log, we assume that infinispan is trying to handle a cluster partition

      [Server:server-one-QA] 11:10:46,010 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-6,server-one-QA-44409, site-id=, rack-id=, machine-id=group1) ISPN000093: Received new,

      MERGED cluster view for channel QA-CLUSTER: MergeView::[server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2|9] (5) [server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2,

      server-one-QA-44409, site-id=null, rack-id=null, machine-id=group1, server-four-QA-28029, site-id=null, rack-id=null, machine-id=group2, server-five-QA-53748, site-id=null, rack-id=null, machine-id=group3, s

      erver-six-QA-41240, site-id=null, rack-id=null, machine-id=group3], 2 subgroups: [server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2|7] (4) [server-three-QA-10596, site-id=null, rack-id=nul

      l, machine-id=group2, server-four-QA-28029, site-id=null, rack-id=null, machine-id=group2, server-five-QA-53748, site-id=null, rack-id=null, machine-id=group3, server-six-QA-41240, site-id=null, rack-id=null

      , machine-id=group3], [server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2|8] (5) [server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2, server-four-QA-28029, site-id=null, r

      ack-id=null, machine-id=group2, server-five-QA-53748, site-id=null, rack-id=null, machine-id=group3, server-six-QA-41240, site-id=null, rack-id=null, machine-id=group3, server-one-QA-44409, site-id=null, rac

      k-id=null, machine-id=group1]

      • After a while, the first node of the new partition start throwing TimeoutException due to a no-response of a member of other partition

      Caused by: org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl

              at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:170) [infinispan-commons-7.0.0.Final.jar:7.0.0.Final]

              at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:869) [infinispan-core-7.0.0.Final.jar:7.0.0.Final]

              at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:638) [infinispan-core-7.0.0.Final.jar:7.0.0.Final]

      ...

      Caused by: org.jgroups.TimeoutException: timeout waiting for response from server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2, request: org.jgroups.blocks.UnicastRequest@2ce9ff0f, req_id=2, mode=GET_ALL, target=server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2

              at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:429) [jgroups-3.6.0.Final.jar:3.6.0.Final]

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:372) [infinispan-core-7.0.0.Final.jar:7.0.0.Final]

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:167) [infinispan-core-7.0.0.Final.jar:7.0.0.Final]

              ... 157 more

       

      • And a lot of this JGROUP warnings

      11:14:54,614 WARN  [org.jgroups.protocols.pbcast.GMS] (Incoming-9,server-one-QA-44409, site-id=, rack-id=, machine-id=group1) server-one-QA-44409, site-id=, rack-id=, machine-id=group1: failed to collect all ACKs (expected=5) for view [server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2|19] after 2000ms, missing 5 ACKs from (5) server-three-QA-10596, site-id=null, rack-id=null, machine-id=group2, server-one-QA-44409, site-id=null, rack-id=null, machine-id=group1, server-four-QA-28029, site-id=null, rack-id=null, machine-id=group2, server-five-QA-53748, site-id=null, rack-id=null, machine-id=group3, server-six-QA-41240, site-id=null, rack-id=null, machine-id=group3


      If our strategy is supposed to work, what we can do? If it is not ... what we can do to upgrade a cluster without interrupting the service?

       

      Thanks in advance!