3 Replies Latest reply on Jul 3, 2017 5:52 AM by dan.berindei

Stale Node on Cluster Causing Issues

thartwell Jun 22, 2017 10:12 AM

I'm using Infinispan 8.2.6. I'm using embedded, replicated mode on a cluster of around 50 nodes. The cluster is an asymmetric cache where only 2 nodes start a cache for any given data set. I have an issue where a node somehow became the data owner, shows in our logs as an "abrupt leaver", but now is disrupting our ability to reset the cache. I'm having trouble even locating this node. It did not start with our naming convention for some reason, just a UUID, which doesn't tell me anything about where this node might be located on the cluster.

A few days ago, app1 logged

app1.log.3:2017-06-18 15:32:37.552 ERROR 54709 --- [transport-thread-app1-p4-t17] org.infinispan.CLUSTER : [Context=cacheA:customerId:91]ISPN000313: Lost data because of abrupt leavers [app3-28900, d3d40625-97d8-b582-2990-42902e8027fb]

app1.log:2017-06-21 15:44:04.339 TRACE 19174 --- [OOB-29,ISPN,app1-53740] o.i.r.i.GlobalInboundInvocationHandler : Attempting to execute non-CacheRpcCommand: CacheTopologyControlCommand{cache=cacheA:customerId, type=CH_UPDATE, sender=app2-62589, joinInfo=null, topologyId=440, rebalanceId=413, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[d3d40625-97d8-b582-2990-42902e8027fb: 256]}, pendingCH=null, availabilityMode=null, actualMembers=[d3d40625-97d8-b582-2990-42902e8027fb], throwable=null, viewId=8450} [sender=app2-62589]

At this point, we cannot reset the cache, which we normally do from app1, by calling

manager.getCache().getAdvancedCache().withFlags(Flag.FORCE_SYNCHRONOUS);

cache.clear();

cache.putAll(newEntriesFromDb);

since at the time of these calls are attempted, a TimeoutException occurs

2017-06-22 08:59:37.877 TRACE 34340 --- [timeout-thread-app1-p3-t1] o.i.r.t.jgroups.JGroupsTransport : Responses: Responses{

app3-35454: sender=app3-35454, retval=CacheNotFoundResponse, received=true, suspected=false

app3-46656: sender=app3-46656, retval=CacheNotFoundResponse, received=true, suspected=false

app3-16612: sender=app3-16612, retval=CacheNotFoundResponse, received=true, suspected=false

app3-7522: null

app3-28597: sender=app3-28597, retval=CacheNotFoundResponse, received=true, suspected=false

...

The end result is that we cannot modify this cache at the moment. Any ideas on how I locate this node when all I have is a random identifier? Why would it interfere like this and is it possible to remove this node from the cluster forcibly or at least remove it from the steps of resetting the cache?

Thanks in advance for any help,

Tom

1. Re: Stale Node on Cluster Causing Issues

galder.zamarreno Jul 3, 2017 5:02 AM (in response to thartwell)

Is this running standalone or on top of Wildfy or another application server?

Has this happen multiple times? In any case, I'd suggest you upgrade to latest Infinispan release, which is 9.0.3.Final.

I see some TRACE messages in your comment, so maybe you can attach/link those TRACE logs?

I don't know whether such an asymmetric set up would work as expected. We don't normally encourage it.

I've seen UUID style addresses before but I'm not sure what they represent. I'll check with my colleagues and see whether they can comment here.
Actions
2. Re: Stale Node on Cluster Causing Issues

galder.zamarreno Jul 3, 2017 5:22 AM (in response to galder.zamarreno)

The UUIDs should not be a problem. Logical names are only for pretty-printing. Underneath UUIDs are used.

What is more problematic is those "CacheNotFoundResponse" errors that appear in the error messages. Those mean that the cache being looked up can't be found. This is likely due to the asymmetric set up you have in place.

We did some work in the past to get asymmetric set ups working and so it should work. I'd try with the latest Infinispan release available, 9.0.3.Final.
Actions
3. Re: Stale Node on Cluster Causing Issues

dan.berindei Jul 3, 2017 5:52 AM (in response to galder.zamarreno)

Thomas, Galder is right, UUIDs are not a problem. But they do have a tendency of appearing in the logs only after something went wrong...

The CacheNotFoundResponses are ignored, the only responses that matters are the ones from members in the cache. That being said, replicated mode will always broadcast writes to all the nodes in the cluster, not just to nodes that have the cache running, so it's quite inefficient to run a repl cache with 2 nodes in a 50-node cluster.

To me, the problematic line is app3-7522: null, because it means node app3-7522 did not send a response back before remote-timeout expired. Do you have any exceptions in the app3-7522 logs, or can you get some thread dumps to see how the OOB/remote-executor thread pools are doing?
Actions

Go to original post