3 Replies Latest reply on Sep 16, 2014 7:49 AM by Radim Vansa

    CacheException: Initial state transfer timed out for cache

    Alfie Kirkpatrick Newbie

      Hi, am seeing the following error when starting a new node in a cluster. The cache in question is REPL_SYNC. I read a few posts about this error but they don't seem to match our scenario, or they were to do with bugs since fixed.

       

      Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache active_rule on ip-172-31-9-161-1572

              at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:199)

              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

              at java.lang.reflect.Method.invoke(Method.java:606)

              at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:183)

       

      I am going to work around this by setting awaitInitialTransfer(false)... this cache has a backing store.

       

      What is slightly confusing me about the exception is that '172-31-9-161' is the node that is starting. Is it in a deadlock waiting for itself? Or is this just information about the node having a problem getting the initial state?

       

      Some further information...

      - There are three nodes in the cluster

      - This doesn't happen all the time. Normally new nodes are able to join no problem, but having got itself in this state, this node is consistently unable to join, timing out after 4 minutes with the same error

      - The two healthy nodes were under load at the time I first tried to join the third node, but having removed the load, problem joining persists

      - We are using mongo ping to coordinate cluster members (running in EC2)

      - After this failure, the cluster is in a strange state. The cluster size is reported as 3, but am seeing NoClassDef errors both on the node in question and the other nodes (reporting the remote exception from cluster member). It's like the node is half in the cluster but didn't initialise properly. This is likely because my app is Spring-based and after the initial timeout the Spring context startup is abandoned, so there is other initialisation not happening. Still an odd error though (examples below)

       

      2014-09-16 09:43:37,945 ERROR TimeScheduler3            | failed executing task UNICAST3: RetransmitTask (interval=500 ms)

      java.lang.NoClassDefFoundError: org/jgroups/util/Table$Missing

              at org.jgroups.util.Table.getMissing(Table.java:572)

              at org.jgroups.protocols.UNICAST3.triggerXmit(UNICAST3.java:1311)

              at org.jgroups.protocols.UNICAST3$RetransmitTask.run(UNICAST3.java:1289)

              at org.jgroups.util.TimeScheduler3$Task.run(TimeScheduler3.java:277)

              at org.jgroups.util.TimeScheduler3$RecurringTask.run(TimeScheduler3.java:308)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

              at java.lang.Thread.run(Thread.java:745)

      Caused by: java.lang.ClassNotFoundException: org.jgroups.util.Table$Missing

              at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1718)

              at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1569)

              ... 8 more

      2014-09-16 09:43:37,960 ERROR TCP                       | JGRP000030: ip-172-31-9-161-35145: failed handling incoming message: java.lang.NoClassDefFoundError: org/jgroups/protocols/TP$BatchHandler

       

      Thanks, Alfie.