3 Replies Latest reply on Sep 16, 2014 7:49 AM by rvansa

    CacheException: Initial state transfer timed out for cache

    jugglingcats

      Hi, am seeing the following error when starting a new node in a cluster. The cache in question is REPL_SYNC. I read a few posts about this error but they don't seem to match our scenario, or they were to do with bugs since fixed.

       

      Caused by: org.infinispan.commons.CacheException: Initial state transfer timed out for cache active_rule on ip-172-31-9-161-1572

              at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:199)

              at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

              at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

              at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

              at java.lang.reflect.Method.invoke(Method.java:606)

              at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:183)

       

      I am going to work around this by setting awaitInitialTransfer(false)... this cache has a backing store.

       

      What is slightly confusing me about the exception is that '172-31-9-161' is the node that is starting. Is it in a deadlock waiting for itself? Or is this just information about the node having a problem getting the initial state?

       

      Some further information...

      - There are three nodes in the cluster

      - This doesn't happen all the time. Normally new nodes are able to join no problem, but having got itself in this state, this node is consistently unable to join, timing out after 4 minutes with the same error

      - The two healthy nodes were under load at the time I first tried to join the third node, but having removed the load, problem joining persists

      - We are using mongo ping to coordinate cluster members (running in EC2)

      - After this failure, the cluster is in a strange state. The cluster size is reported as 3, but am seeing NoClassDef errors both on the node in question and the other nodes (reporting the remote exception from cluster member). It's like the node is half in the cluster but didn't initialise properly. This is likely because my app is Spring-based and after the initial timeout the Spring context startup is abandoned, so there is other initialisation not happening. Still an odd error though (examples below)

       

      2014-09-16 09:43:37,945 ERROR TimeScheduler3            | failed executing task UNICAST3: RetransmitTask (interval=500 ms)

      java.lang.NoClassDefFoundError: org/jgroups/util/Table$Missing

              at org.jgroups.util.Table.getMissing(Table.java:572)

              at org.jgroups.protocols.UNICAST3.triggerXmit(UNICAST3.java:1311)

              at org.jgroups.protocols.UNICAST3$RetransmitTask.run(UNICAST3.java:1289)

              at org.jgroups.util.TimeScheduler3$Task.run(TimeScheduler3.java:277)

              at org.jgroups.util.TimeScheduler3$RecurringTask.run(TimeScheduler3.java:308)

              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

              at java.lang.Thread.run(Thread.java:745)

      Caused by: java.lang.ClassNotFoundException: org.jgroups.util.Table$Missing

              at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1718)

              at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1569)

              ... 8 more

      2014-09-16 09:43:37,960 ERROR TCP                       | JGRP000030: ip-172-31-9-161-35145: failed handling incoming message: java.lang.NoClassDefFoundError: org/jgroups/protocols/TP$BatchHandler

       

      Thanks, Alfie.

        • 1. Re: CacheException: Initial state transfer timed out for cache
          rvansa

          There's a default timeout of 4 minutes for balancing the cluster - transfering the data from the other 2 nodes to the new one joining. The exception seems correct - the node with given IP did not managed to receive all the data within the timeout.

           

          You can either disable waiting for the data as you've already done, or increase this timeout (StateTransferConfigurationBuilder.timeout()).

          1 of 1 people found this helpful
          • 2. Re: CacheException: Initial state transfer timed out for cache
            jugglingcats

            Would state transfer also apply to DIST caches where the new node became the owner -- ie. will it try to transfer entries from an existing owner when the topology changes? Our num entries is 100,000 on the main DIST cache, and there were that many loaded, so that could easily explain the timeout.

             

            Thanks, Alfie.

            • 3. Re: CacheException: Initial state transfer timed out for cache
              rvansa

              Yes, in DIST mode with 2 owners on a cluster of 3 nodes, you should have 2/3 of the data on each node (the node is primary storage for 1/3 of the data and 1/3 as backup for the data on the other two nodes). The initial state transfer is happening for both DIST and REPL, in DIST you'd transfer 66k entries and with repl all 100k.