1 Reply Latest reply on Mar 6, 2018 10:34 AM by Galder Zamarreño

    Infinispan 7.2.5 node leaving a cluster and not rejoining

    Saravana Kumar S Newbie

      We have hit into a peculiar issue where a cluster with 3 infinispan nodes, one node left the cluster even though it was up and running continuously and while it tried to join back the cluster the other nodes discarded it's message.

      Also this retrial was happening continuously and an exception was thrown every 4 minutes. The CPU usage and the GC times of the node that left the cluster was very high after that even though the client wasn't communicating it. The other nodes were somehow able to cover the client requests through HA though.


      We need to understand the situations/scenarios where such issues can occur.


      Logs for understanding -

      From server.log we found that ABCDE-2 instance was although up & running, it went out of cluster.



      2018-02-21 00:30:12,793 WARN [org.jgroups.protocols.pbcast.NAKACK2] (Incoming-1,ABCDElTopologyCacheCluster,ABCDE-2-ABCDEltopologyservice-2286) JGRP000011: ABCDE-2-ABCDEltopologyservice-2286: dropped message 635 from non-member ABCDE-1-ABCDEltopologyservice-24247 (view=[ABCDE-2-ABCDEltopologyservice-2286|3] (1) [ABCDE-2-ABCDEltopologyservice-2286])



      At 00:30:12, CPU was 94%.

      GC observed 524 milisecond at 00:28:11 whcih eventually reached to 35+ seconds after an hour.



      We observed that ABCDE-2 was trying to connect to the cluster but failed to do so at every 4 minutes, meanwhile other two instances were in the cluster.

      2018-02-21 00:35:40,896 WARN[org.infinispan.remoting.inboundhandler.NonTotalOrderPerCacheInboundInvocationHandler] (remote-thread--p3-t13) ISPN000071: Caught exception when handling command StateResponseCommand

      {cache=ABCDE, origin=ABCDEtopologyservice-24247, topologyId=9}

      : org.infinispan.util.concurrent.TimeoutException: Timed out applying state

      The exception in bold was repeated every 4 minutes.