9 Replies Latest reply on Dec 2, 2014 6:58 PM by rtom

Infinispan Distributed Task failover

rtom Nov 25, 2014 8:00 PM

I'm trying to leverage Infinispan's distributed task failover and I can't seem to get it to work. A little background on what I am trying to do:

I have two server nodes in a cluster. These two nodes share a distributed cache and the cache contains information used to run a task. I am trying to implement a failover feature on them so that if a task is running on server #1 and it goes down, server #2 will be able to pick up the task and complete it.

1) I first create a DistributedCallable object: (MyJob)

MyJob myJob = new MyJob(param1, param2);

2) Then I create a DistributedExecutorService and a DistributedTaskBuilder and configure it with the provided Infinispan random node failover policy:

DistributedExecutorService execService = new DefaultExecutorService(cacheManager.getCache());

DistributedTaskBuilder<Boolean> taskBuilder = execService.createDistributedTaskBuilder(myJob);

taskBuilder = taskBuilder.failoverPolicy(DefaultExecutorService.RANDOM_NODE_FAILOVER);

3) I build the distributedtask and then run it using the DistributedExecutorService:

DistributedTask<Boolean> distTask = taskBuilder.build();

execService.submit(distTask);

During my tests, I do see that the DistributedTask gets sent to either server #1 or server #2, and that the distributed cache is being updated properly by both servers. However, when I try testing for failover, it doesn't seem to work.

For example:

When the task is running on server #1 (I set the task to sleep for about ~20 secs), I kill server #1, yet I don't see the task being re-run or picked up by server #2. Vice versa.

I'm not sure if I'm missing anything, I've done this according to the Infinispan 7.0.x user guide.

Any help would be much appreciated!

1. Re: Infinispan Distributed Task failover

vblagojevic Nov 28, 2014 3:37 PM (in response to rtom)

Hi Ryan,

Lets first try to see that task fails over due to task exception and then we can try to kill node and see why it is not failing over in that scenario. In your task simply raise an instance of Exception and the task should failover to another node. After the first failover the task will not failover any more because RANDOM_NODE_FAILOVER allows only one failover.

Let us know,
Vladimir
1 of 1 people found this helpful
Actions
2. Re: Infinispan Distributed Task failover

rtom Nov 30, 2014 7:24 PM (in response to vblagojevic)

Hi Vladimir,

It doesn't seem like failover works when the task throws an exception. I've set the task so that if it is run on server #1 (server 2 is the originator node) it will throw an exception. Server #2 is able to catch the exception, but the task does not re-run. Do I have to use one of the provided server modules (i.e. Hot Rod/Memcached etc...) in order to get failover working?

Thanks,
Ryan
Actions
3. Re: Infinispan Distributed Task failover

nadirx Dec 1, 2014 3:02 AM (in response to rtom)

Hi Ryan,

the server modules have nothing to do with distributed tasks, so they wouldn't help you (you also cannot yet run distexec and map/reduce tasks over hotrod... hopefully soon).
It seems like the failover tests only consider "failure" an exception during the execution on one node, and not on topology changes. We'll verify that and will let you know asap.

Tristan
1 of 1 people found this helpful
Actions
4. Re: Infinispan Distributed Task failover

vblagojevic Dec 1, 2014 10:15 AM (in response to nadirx)

Ryan,

When you submit your callable task to the distributor executor service, do you obtain the future and invoke get on it? If possible, please provide the entire code snippet of your client application.

Regards,
Vladimir
Actions
5. Re: Re: Infinispan Distributed Task failover

rtom Dec 1, 2014 4:09 PM (in response to vblagojevic)

Hi Vladimir,

Yes I do obtain and invoke get on the future object. I have the code snippet below, let me know if you need more info or need me to elaborate:

UsageReportJobManager usageReportingProcess = new UsageReportJobManager();                    // UsageReportJobManager is the callable task, returns a Boolean
DistributedExecutorService execService = new DefaultExecutorService(cacheManager.getCache());
DistributedTaskBuilder<Boolean> taskBuilder = execService.createDistributedTaskBuilder(usageReportingProcess);
taskBuilder = taskBuilder.failoverPolicy(DefaultExecutorService.RANDOM_NODE_FAILOVER);
DistributedTask<Boolean> distTask = taskBuilder.build();
Future<Boolean> future = execService.submit(distTask);
try {
          Boolean test = future.get();
}
catch (Exception e) {
     System.out.println("Caught distributed task exception");
}

When the non-originator node throws an exception while running the callable task, the originator node does catch the exception.

Also, to see if I understood Tristan's reply correctly, a "failure" only occurs when an exception is thrown during the run of the callable task, but if the node shuts down/crashes while the task is not yet complete, this won't be considered a failure?

Thanks for your guys' help in this so far, it's greatly appreciated!

- Ryan
Actions
6. Re: Re: Infinispan Distributed Task failover

vblagojevic Dec 1, 2014 4:38 PM (in response to rtom)

Ryan,

It should do the failover even in the case of shutdown/crash. What is your FD protocol setup in JGroups stack used for this cache manager? Set the timeout for your task to 60 seconds at least (DistributedTaskBuilder#timeout) and then make sure that FD setup detects failure rather quickly, certainly less that task timeout. This should interrupt the remote call waiting for that task, raise an exception and task should fail over.

No worries, we appreciate your help as well. We'll get to the bottom of this one!

Regards,
Vladimir
Actions
7. Re: Infinispan Distributed Task failover

rtom Dec 2, 2014 2:56 AM (in response to vblagojevic)

My FD protocol in the JGroups stack is:

   <FD_SOCK/>
   <FD timeout="3000" max_tries="5"/>
   <VERIFY_SUSPECT timeout="1500"/>

Does this seem right to you?

Thanks,
Ryan
Actions
8. Re: Infinispan Distributed Task failover

vblagojevic Dec 2, 2014 2:26 PM (in response to rtom)

Ryan,

I revisited the codebase for failover, wrote another unit test, and it should work given that SuspectException gets raised for the node kill/crash. How are you killing this node in your test scenario? Are you sending it a sig or are you pulling the network cable? I wrote a test of my own and the task fails over without a problem. You can send me an email at vblagoje at redhat.com and I'll send the test to you. Also as far as node crash detection goes see FD versus FD_SOCK for more details.

In the case above the worst case scenario for node crash detection will be (3000 * 5) which is 15 seconds and then another 2 seconds or so for the suspect verification - in total around 17 sec. Your callable should run at least this much in order to have a chance of failover. Therefore, in your test setup, set your Callable to sleep for 30 seconds, node crash should be detected within 20, and see if failover occurs in this case. Also set task timeout to some large value say 1 minute.

Let us know,
Vladimir
Actions
9. Re: Infinispan Distributed Task failover

rtom Dec 2, 2014 6:58 PM (in response to vblagojevic)

Hi Vladimir,

According to your reply I have done the following:

1) I have set the FD parameter in JGroups stack to: <FD timeout="1000" max_tries="4"/> (4 seconds)
2) I have set the DistributedCallable to sleep for 30 seconds
3) The task timeout is set to 1 minute - taskBuilder = taskBuilder.timeout(60, TimeUnit.SECONDS);

I initiate the task on Server 1, and I see it get pushed over to Server 2 for execution. When I see that the task enters its sleep state, I send a "kill -9" signal to kill the TomCat process on Server 2.

On Server 1 I do see that it picks up that Server 2 has been killed/crashed (SuspectException):

Caught distributed task exception
Exception: java.util.concurrent.ExecutionException: org.infinispan.remoting.transport.jgroups.SuspectException: Node srv01-60081 was suspected
Exception type: class java.util.concurrent.ExecutionException
Exception message: org.infinispan.remoting.transport.jgroups.SuspectException: Node srv01-60081 was suspected

The only thing now is that the task (callable) is not re-run.

Thanks,
Ryan
Actions

Go to original post