4 Replies Latest reply on Sep 14, 2007 11:00 AM by jeffj55374

JobExecutor, maxLockTime, and the LockMonitorThread: What's

jeffj55374 Sep 13, 2007 4:46 PM

Hi,
I'm having a hard time finding any detailed documentation about the maxLockTime of the JobExecutor.

So I resorted to looking at the code, but I'm still not sure I understand its intent and implementation.

Background: We are using jBPM to implemented data processing flows. We use async nodes and have multiple JobExecutor threads running. Some of the async steps may take hours - (It's an automated process so Node.executeAction() chews on the data for a long time). This is being done with jBPM 3.2.1 and the jar that is attached to http://jira.jboss.com/jira/browse/JBPM-1042

During this we discovered that the if the node takes longer than maxLockTime (which in the defaults was set to 600000 - 10 minutes). The executeAction() for the node did complete successfully, but jBPM still chose to rollback the transaction even though no exception was thrown and we properly took a transaction to the next node.

Question: If the execute method on the Node completes w/o error and did its job and moved on to the next node, why does it make sense to rollback the process instance?

Our workaround is to just set maxLockTime = MAXINT.

See the code below: From JobExecutorThread.java

protected void executeJob(Job job) {
 JbpmContext jbpmContext = jbpmConfiguration.createJbpmContext();
 try {
 JobSession jobSession = jbpmContext.getJobSession();
 job = jobSession.loadJob(job.getId());

 try {
 log.debug("executing job "+job);
 if (job.execute(jbpmContext)) {
 jobSession.deleteJob(job);
 }

 } catch (Exception e) {
 log.debug("exception while executing '"+job+"'", e);
 StringWriter sw = new StringWriter();
 e.printStackTrace(new PrintWriter(sw));
 job.setException(sw.toString());
 job.setRetries(job.getRetries()-1);
 }

 // if this job is locked too long
 long totalLockTimeInMillis = System.currentTimeMillis() - job.getLockTime().getTime();
 if (totalLockTimeInMillis>maxLockTime) {
 jbpmContext.setRollbackOnly();
 }

 } finally {
 try {
 jbpmContext.close();
 } catch (RuntimeException e) {
 log.error("problem committing job execution transaction", e);
 throw e;
 }
 }
 }

While reading the code to try to figure this out I noticed the following things that I didn't understand. Any insight to what I'm missing would be great.

1. JobExecutor.start() creates a new instance of LockMonitorThread. But I can't find where that thread is ever started. Doesn't seem to make sense to create the thread and never start it. Is the intent that the thread be running?

public synchronized void start() {
if (! isStarted) {
log.debug("starting thread group '"+name+"'...");
for (int i=0; i<nbrOfThreads; i++) {
startThread();
}
isStarted = true;
} else {
log.debug("ignoring start: thread group '"+name+"' is already started'");
}

lockMonitorThread = new LockMonitorThread(jbpmConfiguration, lockMonitorInterval, maxLockTime, lockBufferTime);
}

2. But if the LockMonitorThread would be running, I think it would lead to some potentially harmful side effects. It updates a job's lock, but the JobExecutorThread that is running the job is going to continue processing firing events and taking the transition to the next node until a wait state is reached. As far as I can tell, no code every looks at the lock on the job. To me it looks like it would simply reset the lock owner and the time so that another instance of JobExecutorThread could acquire the job even though it is still running in another thread.

From LockMonitorThread:

protected void unlockOverdueJobs() {
 JbpmContext jbpmContext = jbpmConfiguration.createJbpmContext();
 try {
 JobSession jobSession = jbpmContext.getJobSession();

 Date treshold = new Date(System.currentTimeMillis()-maxLockTime-lockBufferTime);
 List jobsWithOverdueLockTime = jobSession.findJobsWithOverdueLockTime(treshold);
 Iterator iter = jobsWithOverdueLockTime.iterator();
 while (iter.hasNext()) {
 Job job = (Job) iter.next();
 // unlock
 log.debug("unlocking "+job+ " owned by thread "+job.getLockOwner());
 job.setLockOwner(null);
 job.setLockTime(null);
 jobSession.saveJob(job);
 }

 } finally {
 try {
 jbpmContext.close();
 } catch (RuntimeException e) {
 log.error("problem committing job execution transaction", e);
 throw e;
 }
 }
 }

1. Re: JobExecutor, maxLockTime, and the LockMonitorThread: Wha

kukeltje Sep 13, 2007 9:26 PM (in response to jeffj55374)

Question: If the execute method on the Node completes w/o error and did its job and moved on to the next node, why does it make sense to rollback the process instance?

Isn't this the same as transactiontimeouts work in j2ee? Roling back something that just took to long but finished without an error
Actions
2. Re: JobExecutor, maxLockTime, and the LockMonitorThread: Wha

estaub Sep 14, 2007 7:34 AM (in response to jeffj55374)

>> JobExecutor.start() creates a new instance of LockMonitorThread. But I can't find where that thread is ever started.

Ditto.

-Ed Staub
Actions
3. Re: JobExecutor, maxLockTime, and the LockMonitorThread: Wha

kukeltje Sep 14, 2007 8:53 AM (in response to jeffj55374)

but IS it started?
Actions
4. Re: JobExecutor, maxLockTime, and the LockMonitorThread: Wha

jeffj55374 Sep 14, 2007 11:00 AM (in response to jeffj55374)

but IS it started?

Nope. I ran our application in the Eclipse debugger and did not see any reference to LockMonitorThread. I also generated a thread dump of our application, LockMonitorThread does not appear there either. (The dump file is 1800+ lines, I can email it to anybody that is interested in double checking)

Note that we are not using a J2EE server. We are running this as a standalone Java application.

So with respect to the comment:
Isn't this the same as transactiontimeouts work in j2ee? Roling back something that just took to long but finished without an error

Maybe so, but we aren't using this in a J2EE server. In any case, this isn't a huge issue for us at the moment since we can simply increase the maxLockTime to be the Integer.MAX_VALUE

Philosophical commentary follows (feel free to stop reading)
So here is my opinion. Any comments and feedback would be appreciated. This is mostly intended to help me understand the jBPM design intent with respect to rolling back successfully completed activities simply because they took too long.

So if the node completes succesfully, the transaction shouldn't rollback. Since the timeout doesn't cause the thread (execution context / token) to be aborted the system is doing the work anyhow. You aren't saving anything by considering it a failure. If the intent is that it should be considered an error if a sequence of nodes takes too long to process, then you probably should somehow abort the processing rather than let it continue. Now what would happen (if the LockMonitorThread is actually running) another thread would acquire the job and start running. Now you have two threads actively processing the same path. (Maybe this is a mute point since I think some of the async / JobExecutor stuff is changing post V3.2.1?)

Wouldn't be so bad if everything our process did was inside the same database and transaction and didn't have any external side affects that are either difficult or impossible to rollback. So with this behavior we somehow would need to add our own code to test for exceeding maxLockTime and that a rollback was going to occur and we would have to figure out how to detect that jBPM decided to roll things back and compensate or reverse the external actions. For example one of our steps loads millions of rows of data, impractical to have in a single transaction and rollback the data. Our code doesn't get any notification that the rollback occurred so if we have no opportunity to execute compensating code. Maybe the design intent is that everything should be done in such a manner that it is controlled by a transaction monitor. This is generally a bunch of non-trivial work for non-database related activities.

So again, I still don't see the value in having a maxLockTime for a job if you are going to let the job run to completion w/o aborting it and allow the system to attempt to process the path again even though it is still currently being processed. This could lead to all kinds of chaos.
Actions

Go to original post