9 Replies Latest reply: Nov 26, 2014 9:09 PM by Ryan tom RSS

    Distributed Task Failover on Node Failure

    Ovidiu Feodorov Master

      According to the current documentation (https://docs.jboss.org/author/display/ISPN/Infinispan+Distributed+Execution+Framework#InfinispanDistributedExecutionFramework-Distributedtaskfailoverandmigration), an Infinispan cluster should detect a node failure and migrate a distributed task currently running on that node on the next suitable node.

       

      I have tried simulating this scenario with 5.1.2.FINAL (and I have reasons to suspect 5.1.4.FINAL behaves similarly):

       

      1) three node cluster (A-***, B-*** and C-***)

      2) a distributed task submitted in parallel on all nodes with submitEverywhere(distributedCallable) and no input keys from A-***

      3) killing the node B-*** - not the one that initiated the callable - while the task was running on it.

       

      The node failure has been detected by the cluster, which performed a view change, but instead of the expected result (three futures that return valid results, albeit one not computed on the node that died, but on a backup node), I have seen:

       

      > got response from C-17379

       

      Exception in thread "main" java.util.concurrent.ExecutionException: org.infinispan.CacheException: SuspectedException

              at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)

              at java.util.concurrent.FutureTask.get(FutureTask.java:83)

              at org.infinispan.distexec.DefaultExecutorService$DistributedRunnableFuture.get(DefaultExecutorService.java:557)

              at com.novaordis.playground.infinispan.command.LaunchDistributedCallable.execute(LaunchDistributedCallable.java:117)

              at com.novaordis.playground.infinispan.Main.readCommandsFromCommandLineAndPassThemToNode(Main.java:78)

              at com.novaordis.playground.infinispan.Main.main(Main.java:39)

      Caused by: org.infinispan.CacheException: SuspectedException

              at org.infinispan.util.Util.rewrapAsCacheException(Util.java:524)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:168)

              at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:478)

              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:148)

              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:169)

              at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:219)

              at org.infinispan.remoting.rpc.RpcManagerImpl.access$000(RpcManagerImpl.java:78)

              at org.infinispan.remoting.rpc.RpcManagerImpl$1.call(RpcManagerImpl.java:249)

              at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

              at java.util.concurrent.FutureTask.run(FutureTask.java:138)

              at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

              at java.lang.Thread.run(Thread.java:662)

      Caused by: SuspectedException

              at org.jgroups.blocks.MessageDispatcher.sendMessage(MessageDispatcher.java:349)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processSingleCall(CommandAwareRpcDispatcher.java:263)

              at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommand(CommandAwareRpcDispatcher.java:163)

              ... 11 more

       

       

      It is my understanding of the distributed task migration mechanism correct, and my expectations valid?

       

      If no, could you please point me to the right direction? What exactly does "task migration" mean and what result is expected for the scenario presented above?

       

      If yes, is this a feature not implemented yet (as the documentation seem to suggest?)

       

      I have a command line testing tool that makes simulating all these various scenarios easy, and I will be delighted to share it with the dev team, if they believe this case is worth investigating and get to the bottom to.

       

      This thread is related to https://community.jboss.org/message/731545, I just took the failure detection out of the picture; failure detection works fine with proper tunning.

       

      Thanks,
      Ovidiu