6 Replies Latest reply on Jul 21, 2011 2:06 PM by galder.zamarreno

    Replication Timeout just after restarting hotrod server modules

    boudhomme

      Hi,

      I'm testing Infinispan (4.2.1 FINAL) with 3 hotrod server modules, but I get randomly "Replication Timeout" just after restarting modules with a cleaned cacheStore (CTRL-C ; rm -rf /FileCacheStoreLocation/*  then startserver.sh -c .... on each server). I also noticed that this error occurs when the cache was heavy loaded.

       

      Why I'm getting "Replication Timeout" even though the cache is clean (no entry in the cache) and the modules just restarted ?

       

       

      The log from one node :

       

       

      2011-05-24 11:07:44,460 INFO  [org.infinispan.server.core.Main$] (main) Start main with args: -l, 10.50.1.60, -r, hotrod, -c, ../etc/qualif.xml
      2011-05-24 11:07:46,097 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Starting JGroups Channel
      2011-05-24 11:07:46,098 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Unable to use any JGroups configuration mechanisms provided in properties {}.  Using default JGroups configuration!
      2011-05-24 11:07:50,050 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Received new cluster view: [NB01VMLX-8734|0] [NB01VMLX-8734]
      2011-05-24 11:07:50,064 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (InfinispanServer-Main) Cache local address is NB01VMLX-8734, physical addresses are [10.50.1.60:39637]
      2011-05-24 11:07:50,152 INFO  [org.infinispan.loaders.decorators.AsyncStore] (InfinispanServer-Main) Async cache loader starting org.infinispan.loaders.decorators.AsyncStore@51017ca1
      2011-05-24 11:07:50,270 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-1,NB01VMLX-8734) Received new cluster view: [NB01VMLX-8734|1] [NB01VMLX-8734, NB02VMLX-28677]
      2011-05-24 11:07:50,436 INFO  [org.infinispan.loaders.decorators.AsyncStore] (InfinispanServer-Main) Async cache loader starting org.infinispan.loaders.decorators.AsyncStore@76d78df0
      2011-05-24 11:07:50,518 INFO  [org.infinispan.remoting.rpc.RpcManagerImpl] (InfinispanServer-Main) Trying to fetch state from NB02VMLX-28677
      2011-05-24 11:07:50,528 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-2,NB01VMLX-8734) Received new cluster view: [NB01VMLX-8734|2] [NB01VMLX-8734, NB02VMLX-28677, NB03VMLX-19057]
      2011-05-24 11:07:50,528 INFO  [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,NB01VMLX-8734) Detected a view change.  Member list changed from [NB01VMLX-8734, NB02VMLX-28677] to [NB01VMLX-8734, NB02VMLX-28677, NB03VMLX-19057]
      2011-05-24 11:07:50,528 INFO  [org.infinispan.distribution.DistributionManagerImpl] (Incoming-2,NB01VMLX-8734) This is a JOIN event!  Wait for notification from new joiner NB03VMLX-19057
      2011-05-24 11:07:50,635 INFO  [org.infinispan.remoting.rpc.RpcManagerImpl] (InfinispanServer-Main) Successfully retrieved and applied state from NB02VMLX-28677
      2011-05-24 11:08:00,790 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (InfinispanServer-Main) Execution error: 
      org.infinispan.util.concurrent.TimeoutException: Replication timeout for NB02VMLX-28677
          at org.infinispan.remoting.transport.AbstractTransport.parseResponseAndAddToResponseList(AbstractTransport.java:49)
          at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:419)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:102)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:126)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:231)
          at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:218)
          at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:201)
          at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:194)
          at org.infinispan.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:115)
          at org.infinispan.interceptors.ReplicationInterceptor.visitPutKeyValueCommand(ReplicationInterceptor.java:78)
          at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:80)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.LockingInterceptor.visitPutKeyValueCommand(LockingInterceptor.java:297)
          at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:80)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.CacheStoreInterceptor.visitPutKeyValueCommand(CacheStoreInterceptor.java:203)
          at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:80)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.CacheLoaderInterceptor.visitPutKeyValueCommand(CacheLoaderInterceptor.java:81)
          at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:80)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:132)
          at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:58)
          at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:80)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.TxInterceptor.enlistWriteAndInvokeNext(TxInterceptor.java:184)
          at org.infinispan.interceptors.TxInterceptor.visitPutKeyValueCommand(TxInterceptor.java:132)
          at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:80)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:113)
          at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:80)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:132)
          at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:58)
          at org.infinispan.interceptors.IsMarshallableInterceptor.visitPutKeyValueCommand(IsMarshallableInterceptor.java:86)
          at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:80)
          at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:118)
          at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:95)
          at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:62)
          at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:58)
          at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:80)
          at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:273)
          at org.infinispan.CacheDelegate.putIfAbsent(CacheDelegate.java:459)
          at org.infinispan.CacheSupport.putIfAbsent(CacheSupport.java:40)
          at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:88)
          at org.infinispan.server.hotrod.HotRodServer$$anonfun$1.apply(HotRodServer.scala:78)
          at org.infinispan.server.hotrod.HotRodServer.isViewUpdated(HotRodServer.scala:105)
          at org.infinispan.server.hotrod.HotRodServer.org$infinispan$server$hotrod$HotRodServer$$updateTopologyView(HotRodServer.scala:100)
          at org.infinispan.server.hotrod.HotRodServer.addSelfToTopologyView(HotRodServer.scala:78)
          at org.infinispan.server.hotrod.HotRodServer.startTransport(HotRodServer.scala:66)
          at org.infinispan.server.core.AbstractProtocolServer.start(AbstractProtocolServer.scala:70)
          at org.infinispan.server.hotrod.HotRodServer.start(HotRodServer.scala:48)
          at org.infinispan.server.core.Main$.boot(Main.scala:114)
          at org.infinispan.server.core.Main$$anon$1.call(Main.scala:68)
          at org.infinispan.server.core.Main$$anon$1.call(Main.scala:65)
          at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
          at java.util.concurrent.FutureTask.run(FutureTask.java:138)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          at java.lang.Thread.run(Thread.java:662)
      2011-05-24 11:09:27,807 INFO  [org.infinispan.server.core.ShutdownHook] (ShutdownHookThread) Posting Shutdown Request to the server...
      2011-05-24 11:09:27,887 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (StopThread) Disconnecting and closing JGroups Channel
      2011-05-24 11:09:28,265 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (StopThread) Stopping the RpcDispatcher
      

       

      The configuration file used by all nodes :

       

      <?xml version="1.0" encoding="UTF-8"?>
      <infinispan
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="urn:infinispan:config:4.2 http://www.infinispan.org/schemas/infinispan-config-4.2.xsd"
            xmlns="urn:infinispan:config:4.2">
         
         <global>
            <transport clusterName="qualif"/>
            <globalJmxStatistics enabled="true"/>
         </global>
      
         <default>
            <jmxStatistics enabled="true"/>
            <clustering mode="distribution">
               <l1 enabled="true" lifespan="60000"/>
          <hash numOwners="2" />
          <sync replTimeout="20000"/>
            </clustering>
          <loaders preload="true" shared="true">
              <loader class="org.infinispan.loaders.file.FileCacheStore">
               <async enabled="true" />
              <properties>
              <property name="location" value="/mnt/nas" />
              </properties>
              </loader>
          </loaders>
          
         </default>
      </infinispan>
      

       

       

      And the bandwidth between nodes is 2GBps.

       

      Thanks !

        • 1. Re: Replication Timeout just after restarting hotrod server modules
          galder.zamarreno

          That's basically the topology information that's been sent around the cluster. The topology information enables clients to be able to load balance and failover between different nodes in the cluster. When the topology changes, everyone in the cluster gets the update and they can then sent it back to clients as part of the response.

           

          I dunno why you get the replication timeout. Check what the other node is showing in the logs. You could also get a thread dump to see if it gets stuck somewhere...

          • 2. Re: Replication Timeout just after restarting hotrod server modules
            boudhomme

            Thanks for your reply, I will try to investigate this point later...

            • 3. Re: Replication Timeout just after restarting hotrod server modules
              gerbszt

              I've got the very same problem. I'm using the latest 5.0.0.CR4 version. Sometimes while starting 3 nodes concurrently, I receive similar exceptions. In my case there is also 10 second lock acquiring timeout:

              2011-06-07 15:06:59,527 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (OOB-12,frontendCluster,front-26-64057) ISPN00136: Execution error

              org.infinispan.util.concurrent.TimeoutException: Unable to acquire lock after [10 seconds] on key [view] for requestor [Thread[OOB-12,frontendCluster,front-26-64057,5,Thread Pools]]! Lock held by [(another thread)]

               

              I think the problem is that during the initialization HotRodServer replaces the topology view, which probably requires the cluster-wide lock:

              HotRodServer.scala:111: val updated = topologyCache.replace("view", currentView, newView)

               

              If this line of code is executed at the same time on three or more nodes in the cluster, it couldn't end up with anything that the deadlock. Infinispan is smart enough to throw an exception after 10 seconds, but unfortunately the exception stops the initialization procedure and the node is unusable. The final issue is that, I can't restart simultaneously 3 or more nodes from my 30 node cluster. It's not ready to go on the production.

               

               

              Is it any way to disable the topology view cache if my clients don't use it?

              • 4. Re: Replication Timeout just after restarting hotrod server modules
                boudhomme

                In fact the lock exception occurs too when I want to push something in the cache after the Timeout Exception.

                 

                 

                 

                Jacek Gerbszt a écrit:

                 

                It's not ready to go on the production.

                 

                Same conclusion for me. Infinispan could be a great application if it could provide something like a dashboard, with health indicator. Actually, starting hortod server is a bit tricky (changing Xms option for the JVM not documented at all in Infinispan... althought this is a memory cache !).

                But I think the most limitation is the eviction mecanism based on an arbitrary number and not on memory usage, impossible to me to estimate this number.

                • 5. Re: Replication Timeout just after restarting hotrod server modules
                  manik

                  I would strongly advise you try with a recent release candidate of 5.0, and provide feedback.  :-)  A lot has changed since 4.2.x.

                  • 6. Re: Replication Timeout just after restarting hotrod server modules
                    galder.zamarreno

                    @Jacek, it's true that the HotRod server tries to update the view on startup. We are looking to improve that as shown by ISPN-448. I'd suggest staggering startup.