0 Replies Latest reply on Dec 11, 2014 8:17 AM by mohammedisaakhan

    One or more nodes have left exception while querying(get, replaceWithVersion) on the cache

    mohammedisaakhan

      Hi,

       

      We are using the Infinispan 6.0.2 Final with hotrod client in our application. We have 3 nodes and are running test with about 30 million entries in the cache and about 300 million requests being processed.

       

      During the Execution after a few hours, we get the following error -

       

      1)Failed to recover cluster state after the current node became the coordinator

      2)org.infinispan.remoting.transport.jgroups.SuspectException: One or more nodes have left the cluster while replicating command PrepareCommand

      3) Message Send failed due to time out

      4)Suspect Messages -  although the nodes were active.

       

      There were no crashes and all the nodes are active! But it seems like some node appeared to leave the cluster(Deduced from error #2) and post that the cluster misbehaves. Most requests return null for cache query although the data is present in the nodes and the nodes are up and active. We have written a debug script which individually queries the cache and the caches respond, but when we run the hotrod client with all node Ip/ports. Only one node seems to respond and other 2 nodes do not respond.

       

      Could you tell me why errors 2,3 occur? Are these identified ? Have they been fixed in 7.x?

       

      This appears to break the system quite often. Kindly reach out with solutions.