Stale Node on Cluster Causing Issues
thartwell Jun 22, 2017 10:12 AMI'm using Infinispan 8.2.6. I'm using embedded, replicated mode on a cluster of around 50 nodes. The cluster is an asymmetric cache where only 2 nodes start a cache for any given data set. I have an issue where a node somehow became the data owner, shows in our logs as an "abrupt leaver", but now is disrupting our ability to reset the cache. I'm having trouble even locating this node. It did not start with our naming convention for some reason, just a UUID, which doesn't tell me anything about where this node might be located on the cluster.
A few days ago, app1 logged
app1.log.3:2017-06-18 15:32:37.552 ERROR 54709 --- [transport-thread-app1-p4-t17] org.infinispan.CLUSTER : [Context=cacheA:customerId:91]ISPN000313: Lost data because of abrupt leavers [app3-28900, d3d40625-97d8-b582-2990-42902e8027fb]
app1.log:2017-06-21 15:44:04.339 TRACE 19174 --- [OOB-29,ISPN,app1-53740] o.i.r.i.GlobalInboundInvocationHandler : Attempting to execute non-CacheRpcCommand: CacheTopologyControlCommand{cache=cacheA:customerId, type=CH_UPDATE, sender=app2-62589, joinInfo=null, topologyId=440, rebalanceId=413, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[d3d40625-97d8-b582-2990-42902e8027fb: 256]}, pendingCH=null, availabilityMode=null, actualMembers=[d3d40625-97d8-b582-2990-42902e8027fb], throwable=null, viewId=8450} [sender=app2-62589]
At this point, we cannot reset the cache, which we normally do from app1, by calling
manager.getCache().getAdvancedCache().withFlags(Flag.FORCE_SYNCHRONOUS);
cache.clear();
cache.putAll(newEntriesFromDb);
since at the time of these calls are attempted, a TimeoutException occurs
2017-06-22 08:59:37.877 TRACE 34340 --- [timeout-thread-app1-p3-t1] o.i.r.t.jgroups.JGroupsTransport : Responses: Responses{
app3-35454: sender=app3-35454, retval=CacheNotFoundResponse, received=true, suspected=false
app3-46656: sender=app3-46656, retval=CacheNotFoundResponse, received=true, suspected=false
app3-16612: sender=app3-16612, retval=CacheNotFoundResponse, received=true, suspected=false
app3-7522: null
app3-28597: sender=app3-28597, retval=CacheNotFoundResponse, received=true, suspected=false
...
The end result is that we cannot modify this cache at the moment. Any ideas on how I locate this node when all I have is a random identifier? Why would it interfere like this and is it possible to remove this node from the cluster forcibly or at least remove it from the steps of resetting the cache?
Thanks in advance for any help,
Tom