-
1. Re: Infinispan Distributed Task failover
vblagojevic Nov 28, 2014 3:37 PM (in response to rtom)1 of 1 people found this helpfulHi Ryan,
Lets first try to see that task fails over due to task exception and then we can try to kill node and see why it is not failing over in that scenario. In your task simply raise an instance of Exception and the task should failover to another node. After the first failover the task will not failover any more because RANDOM_NODE_FAILOVER allows only one failover.
Let us know,
Vladimir
-
2. Re: Infinispan Distributed Task failover
rtom Nov 30, 2014 7:24 PM (in response to vblagojevic)Hi Vladimir,
It doesn't seem like failover works when the task throws an exception. I've set the task so that if it is run on server #1 (server 2 is the originator node) it will throw an exception. Server #2 is able to catch the exception, but the task does not re-run. Do I have to use one of the provided server modules (i.e. Hot Rod/Memcached etc...) in order to get failover working?
Thanks,
Ryan
-
3. Re: Infinispan Distributed Task failover
nadirx Dec 1, 2014 3:02 AM (in response to rtom)1 of 1 people found this helpfulHi Ryan,
the server modules have nothing to do with distributed tasks, so they wouldn't help you (you also cannot yet run distexec and map/reduce tasks over hotrod... hopefully soon).
It seems like the failover tests only consider "failure" an exception during the execution on one node, and not on topology changes. We'll verify that and will let you know asap.
Tristan
-
4. Re: Infinispan Distributed Task failover
vblagojevic Dec 1, 2014 10:15 AM (in response to nadirx)Ryan,
When you submit your callable task to the distributor executor service, do you obtain the future and invoke get on it? If possible, please provide the entire code snippet of your client application.
Regards,
Vladimir
-
5. Re: Re: Infinispan Distributed Task failover
rtom Dec 1, 2014 4:09 PM (in response to vblagojevic)Hi Vladimir,
Yes I do obtain and invoke get on the future object. I have the code snippet below, let me know if you need more info or need me to elaborate:
UsageReportJobManager usageReportingProcess = new UsageReportJobManager(); // UsageReportJobManager is the callable task, returns a Boolean
DistributedExecutorService execService = new DefaultExecutorService(cacheManager.getCache());
DistributedTaskBuilder<Boolean> taskBuilder = execService.createDistributedTaskBuilder(usageReportingProcess);
taskBuilder = taskBuilder.failoverPolicy(DefaultExecutorService.RANDOM_NODE_FAILOVER);
DistributedTask<Boolean> distTask = taskBuilder.build();
Future<Boolean> future = execService.submit(distTask);
try {
Boolean test = future.get();
}
catch (Exception e) {
System.out.println("Caught distributed task exception");
}
When the non-originator node throws an exception while running the callable task, the originator node does catch the exception.
Also, to see if I understood Tristan's reply correctly, a "failure" only occurs when an exception is thrown during the run of the callable task, but if the node shuts down/crashes while the task is not yet complete, this won't be considered a failure?
Thanks for your guys' help in this so far, it's greatly appreciated!
- Ryan
-
6. Re: Re: Infinispan Distributed Task failover
vblagojevic Dec 1, 2014 4:38 PM (in response to rtom)Ryan,
It should do the failover even in the case of shutdown/crash. What is your FD protocol setup in JGroups stack used for this cache manager? Set the timeout for your task to 60 seconds at least (DistributedTaskBuilder#timeout) and then make sure that FD setup detects failure rather quickly, certainly less that task timeout. This should interrupt the remote call waiting for that task, raise an exception and task should fail over.
No worries, we appreciate your help as well. We'll get to the bottom of this one!
Regards,
Vladimir
-
7. Re: Infinispan Distributed Task failover
rtom Dec 2, 2014 2:56 AM (in response to vblagojevic)My FD protocol in the JGroups stack is:
<FD_SOCK/>
<FD timeout="3000" max_tries="5"/>
<VERIFY_SUSPECT timeout="1500"/>
Does this seem right to you?
Thanks,
Ryan
-
8. Re: Infinispan Distributed Task failover
vblagojevic Dec 2, 2014 2:26 PM (in response to rtom)Ryan,
I revisited the codebase for failover, wrote another unit test, and it should work given that SuspectException gets raised for the node kill/crash. How are you killing this node in your test scenario? Are you sending it a sig or are you pulling the network cable? I wrote a test of my own and the task fails over without a problem. You can send me an email at vblagoje at redhat.com and I'll send the test to you. Also as far as node crash detection goes see FD versus FD_SOCK for more details.
In the case above the worst case scenario for node crash detection will be (3000 * 5) which is 15 seconds and then another 2 seconds or so for the suspect verification - in total around 17 sec. Your callable should run at least this much in order to have a chance of failover. Therefore, in your test setup, set your Callable to sleep for 30 seconds, node crash should be detected within 20, and see if failover occurs in this case. Also set task timeout to some large value say 1 minute.
Let us know,
Vladimir
-
9. Re: Infinispan Distributed Task failover
rtom Dec 2, 2014 6:58 PM (in response to vblagojevic)Hi Vladimir,
According to your reply I have done the following:
1) I have set the FD parameter in JGroups stack to: <FD timeout="1000" max_tries="4"/> (4 seconds)
2) I have set the DistributedCallable to sleep for 30 seconds
3) The task timeout is set to 1 minute - taskBuilder = taskBuilder.timeout(60, TimeUnit.SECONDS);
I initiate the task on Server 1, and I see it get pushed over to Server 2 for execution. When I see that the task enters its sleep state, I send a "kill -9" signal to kill the TomCat process on Server 2.
On Server 1 I do see that it picks up that Server 2 has been killed/crashed (SuspectException):
Caught distributed task exception
Exception: java.util.concurrent.ExecutionException: org.infinispan.remoting.transport.jgroups.SuspectException: Node srv01-60081 was suspected
Exception type: class java.util.concurrent.ExecutionException
Exception message: org.infinispan.remoting.transport.jgroups.SuspectException: Node srv01-60081 was suspected
The only thing now is that the task (callable) is not re-run.
Thanks,
Ryan