11 Replies Latest reply on Jul 31, 2019 8:52 PM by cfang

    Restarting a job fails immediately with no reason why

    richardmoore

      I have a job that has failed and when I restart it in the failed step the job fails again with no logging messages as to why.

       

      Below is the jsl.

      • copyDiffFileToGdgWeeklyDiff ends with COMPLETED and moves to the deliverDiffFile step.
      • The deliverDiffFile fails as follows

      2019-07-29 14:13:22.631 jberet-1                        MoveFileBatchlet - INFO  Move file: /awg/t1/app/data/tmp/GC/AWG_WRLDGCITMC.TXT

      2019-07-29 14:13:22.631 jberet-1                        MoveFileBatchlet - INFO  Move to: /awg/t1/app/data/mft/inbound/retail/AWG_WRLDGCITMC.TXT

      2019-07-29 14:13:22.631 jberet-1                        MoveFileBatchlet - WARN  File does not exist.

      2019-07-29 14:13:22.631 jberet-1                        MoveFileBatchlet - INFO  Ending with a status of FILE_DOES_NOT_EXIST

      2019-07-29 14:13:22.658 jberet-1                             CawaManager - INFO  CAWA job user status updated: Ending AIS_Weekly_UniversalItemChangesGot.deliverDiffFile = FILE_DOES_NOT_EXIST

      2019-07-29 14:13:22.658 jberet-1                         JobStepListener - INFO  =======================================================================

      2019-07-29 14:13:22.658 jberet-1                         JobStepListener - INFO  Ending step: AIS_Weekly_UniversalItemChangesGot.deliverDiffFile = FILE_DOES_NOT_EXIST

      2019-07-29 14:13:22.658 jberet-1                         JobStepListener - INFO  =======================================================================

      2019-07-29 14:13:23.485 jberet-1                        JobStatsListener - INFO  Ending job: AIS_Weekly_UniversalItemChangesGot = FAILED

      ...

      2019-07-29 14:13:23.491 jberet-1                        JobStatsListener - INFO  Step: deliverDiffFile

      2019-07-29 14:13:23.491 jberet-1                        JobStatsListener - INFO  Execution Id: 62801

      2019-07-29 14:13:23.491 jberet-1                        JobStatsListener - INFO  Start Time: Mon Jul 29 14:13:22 CDT 2019

      2019-07-29 14:13:23.491 jberet-1                        JobStatsListener - INFO  End Time: Mon Jul 29 14:13:22 CDT 2019

      2019-07-29 14:13:23.491 jberet-1                        JobStatsListener - INFO  Exit Status: FILE_DOES_NOT_EXIST

      2019-07-29 14:13:23.491 jberet-1                        JobStatsListener - INFO  Batch Status: COMPLETED

      2019-07-29 14:13:23.491 jberet-1                        JobStatsListener - INFO  To restart in this step use: -DrestartId=43101 -DrestartStep=deliverDiffFile

      2019-07-29 14:13:23.492 jberet-1                        JobStatsListener - INFO  =======================================================================================

      2019-07-29 14:13:23.523 jberet-1                             CawaManager - INFO  CAWA job user status updated: Ending AIS_Weekly_UniversalItemChangesGot = FAILED

       

         <step id="copyDiffFileToGdgWeeklyDiff">

           &step-listeners;

            <batchlet ref="copyFileToGdg">

               <properties>

                  <property name="source" value="#{systemProperties['APP_DATA_TMP_FCLTY']}/#{jobProperties['diffFile']}"/>

                  <property name="target" value="#{jobProperties['gdgDirectoryWeeklyDiff']}"/>

                  <property name="processEmptyFile" value="true"/>

               </properties>

            </batchlet>

            <next on="COMPLETED" to="deliverDiffFile"/>

            <fail on="*" exit-status="FAILED"/>

         </step>

       

         <step id="deliverDiffFile">

            &step-listeners;

            <batchlet ref="moveFile">

               <properties>

                  <property name="file" value="#{systemProperties['APP_DATA_TMP_FCLTY']}/#{jobProperties['diffFile']}"/>

                  <property name="destination" value="#{systemProperties['APP_DATA_MFT_INBOUND']}/retail/#{jobProperties['diffFile']}"/>

                  <property name="overwrite" value="true" />

               </properties>

            </batchlet>

            <next on="FILE_MOVED" to="deleteDiffFile"/>

            <fail on="*" exit-status="FAILED"/>

         </step>

       

         <step id="deleteDiffFile">

            &step-listeners;

            <batchlet ref="deleteFile">

               <properties>

                  <property name="file" value="#{systemProperties['APP_DATA_TMP_FCLTY']}/#{jobProperties['diffFile']}"/>

               </properties>

            </batchlet>

            <end on="FILE_DOES_NOT_EXIST"/>

            <end on="FILE_DELETED"/>

            <fail on="*" exit-status="FAILED"/>

         </step>

       

      I put the file out there and then restart it I get the following giving no reason as to why it failed -

      2019-07-29 14:14:48.837 jberet-1                        JobStatsListener - INFO  Ending job: AIS_Weekly_UniversalItemChangesGot = FAILED

      2019-07-29 14:14:48.837 jberet-1                        JobStatsListener - INFO  =======================================================================================

      2019-07-29 14:14:48.837 jberet-1                        JobStatsListener - INFO                                 R U N      S T A T S

      2019-07-29 14:14:48.837 jberet-1                        JobStatsListener - INFO  =======================================================================================

      2019-07-29 14:14:48.837 jberet-1                        JobStatsListener - INFO  Job: AIS_Weekly_UniversalItemChangesGot

      2019-07-29 14:14:48.837 jberet-1                        JobStatsListener - INFO  Execution Id: 43121

      2019-07-29 14:14:48.837 jberet-1                        JobStatsListener - INFO  Parm: restartStep = 'deliverDiffFile'

      2019-07-29 14:14:48.838 jberet-1                        JobStatsListener - INFO  Parm: restartId = '43101'

      2019-07-29 14:14:48.841 jberet-1                        JobStatsListener - INFO  Start Time: Mon Jul 29 14:14:47 CDT 2019

      2019-07-29 14:14:48.842 jberet-1                        JobStatsListener - INFO  End Time: Mon Jul 29 14:14:48 CDT 2019

      2019-07-29 14:14:48.842 jberet-1                        JobStatsListener - INFO  Exit Status: FAILED

      2019-07-29 14:14:49.251 jberet-1                        JobStatsListener - INFO  =======================================================================================

      2019-07-29 14:14:49.278 jberet-1                             CawaManager - INFO  CAWA job user status updated: Ending AIS_Weekly_UniversalItemChangesGot = FAILED

        • 1. Re: Restarting a job fails immediately with no reason why
          cfang

          It could be during restart, JBeret didn't find the step to restart and so it just failed.  Is the step deliverDiffFile a top-level step, ie, not a step inside a flow?  For it to restart at this step, this step must be a top-level one.

           

          I noticed there are these settings in your restart process:

          To restart in this step use: -DrestartId=43101 -DrestartStep=deliverDiffFile

           

          How do these properties affect your restart (modify the restart-related data)?

          • 2. Re: Restarting a job fails immediately with no reason why
            richardmoore

            Here is how that is used to restart a failed job.

             

            Properties jobParameters = new Properties();

            jobParameters.setProperty(restartStep, System.getProperty(restartStep));

            jobParameters.setProperty(restartId, System.getProperty(restartId));

            long restartExecutionId = Long.valueOf(System.getProperty("restartId"));

            JobOperator operator = BatchRuntime.getJobOperator();

            long jobExecutionId = operator.restart(restartExecutionId, jobParameters);

            • 3. Re: Restarting a job fails immediately with no reason why
              richardmoore

              Forgot to answer the other questions - I checked the database and the step is there, and this job has no flows.

              • 4. Re: Restarting a job fails immediately with no reason why
                cfang

                Your step has these transition elements:

                      <next on="FILE_MOVED" to="deleteDiffFile"/>

                      <fail on="*" exit-status="FAILED"/>

                 

                So this step will succeed only if the batchlet (MoveFileBatchlet) process method returns "FILE_MOVED", which will transition to deleteDiffFile step.  For all other return values from batchlet process method, it will always fail.

                 

                I suspect during restart, the return value from process() method is still not "FILE_MOVED", so the restart failed.  Can you log the MoveFile batchlet process() mehtod return value just before this method ends?

                • 5. Re: Restarting a job fails immediately with no reason why
                  richardmoore

                  I do record the status information before I exit the process method. That is what has me confused, I don't see any of my messages in the log, see previous snapshot of log, it just goes straight to failed.

                   

                  private static final String operation = "move";

                   

                  @Override

                  public String process() throws Exception {

                  log.info(capitalizeFirstCharacter(operation) + " file: " + file);

                  log.info(capitalizeFirstCharacter(operation) + " to: " + destination);

                  exitStatus = move(file, destination, preserveFileDate, overwrite);

                  log.info("Ending with a status of " + exitStatus);

                  return exitStatus.name();

                  }

                  • 6. Re: Restarting a job fails immediately with no reason why
                    cfang

                    I guess the restart failed during jobListener.beforeJob() method.  The MoveFileBatchlet looks not being called at all, nor was any step listener called.

                     

                    Can you check jobListener's beforeJob() method?  Maybe some error occurred there, but then there should be some stacktrace, unless the job listener swallowed it.

                    • 7. Re: Restarting a job fails immediately with no reason why
                      richardmoore

                      I have two AbstractJobListener classes assigned. Both should produce a message and should not swallow up an exception. Check them to see if you see an issue.

                       

                      public class JobStatsListener extends AbstractJobListener {

                       

                      @Inject

                      private JobContext jobContext;

                       

                      @Override

                      public void beforeJob() throws Exception {

                      String status = "Starting job: " + jobContext.getJobName();

                      if (!System.getProperty("batchJobname").equals(jobContext.getJobName())) {

                      String msg = "Batch jobname does not equal JSL jobname (job@id)";

                      log.fatal(msg);

                      throw new RuntimeException(msg);

                      }

                      log.info(status);

                      }

                       

                       

                       

                      public class JobRestartListener extends AbstractJobListener {

                       

                      @Inject

                      private JobContext jobContext;

                       

                      @Override

                      public void beforeJob() throws Exception {

                      JobContextImpl jci = (JobContextImpl) jobContext;

                       

                      String restartPosition = System.getProperty(JavaBatchExecutor.restartStep);

                      if (restartPosition != null) {

                      log.info("Restarting in " + restartPosition);

                      jci.getJobExecution().setRestartPosition(restartPosition);

                      } else {

                      String startPosition = System.getProperty(JavaBatchExecutor.startStep);

                      if (startPosition != null) {

                      log.info("Starting in " + startPosition);

                      jci.getJobExecution().setRestartPosition(startPosition);

                      } else {

                      log.warn("No starting or restarting position could be found in the executor.");

                      }

                      }

                      }

                      • 8. Re: Restarting a job fails immediately with no reason why
                        richardmoore

                        I have a question unrelated to this - is there a way, if so how, to give suggestions for enhancements to the product?

                        1. Can the JobExecutionNotMostRecentException be modified to also provide what the most recent id is?
                        2. Generate jsl from a programatic job definition?
                        • 9. Re: Restarting a job fails immediately with no reason why
                          cfang

                          Please file 2 separate issues at Issue Navigator - JBoss Issue Tracker , and we will evaulate them.

                          • 10. Re: Restarting a job fails immediately with no reason why
                            cfang

                            I don't see anythong obviously wrong.  So there is output from jobListener.afterJob(), but there is no output from its beforeJob() method?

                             

                            You get certain values from system properties during restart (which were set at the end of last job execution).  Is the restart in the same java process as the original job execution?

                             

                            Can you create a reproducer app so I can debug it?  Or you can attach a debugger to jberet source code and examine the restart process.

                            • 11. Re: Restarting a job fails immediately with no reason why
                              cfang

                              After the restart failed, can you check the job repository and see if this restart execution has step execution?  If so, is it the expected step, and what step execution data are there?  Does the step execution exception column has value?