3 Replies Latest reply on Mar 20, 2015 11:35 AM by jamezp

    JBeret Job Repositories

    jamezp

      In JBeret there is an AbstractJobRepository that stores instances of jobs, job instances and job executions. In WildFly a JobRepository has the lifecycle of the batch subsystem. This means that for every batch job started the repository, no matter the storage type, the job, job instance and job execution will not be available for GC and live in memory until the server is reloaded or restarted. At some point we'd eventually end up with an OOME.

       

      I think we have three options to solve this.

      1. Require the user to purge the jobs with org.jberet.repository.PurgeBatchlet. This may not be user friendly, but it could be setup in a scheduled EJB.
      2. In WildFly a new repository is created each time the job repository is requested. This will have some overhead in some job repositories like the JdbcJobRepository in that it will need to look up configuration files, check that the tables don't exist, reload the data via queries, etc.
      3. Fix the AbstractJobRepository in some way to not keep instances of every job or each job repository needs to implement it's own persistence for the jobs, job instances and the job executions. This would require the removal of the global maps that store the job information from the AbstractJobRepository. For something like the JdbcJobRepository some sort of caching maps could be used; eviction would need to happen at some point though.

       

      I'm curious to see what others think of these solutions or if they have some other solutions. I'd personally lean towards option 3 as option 2 would be expensive when doing things like querying jobs. Option 1 would also be okay, but it puts the responsibility on the user to ensure jobs get purged.

       

      For completeness here's the definition for the job instances and the job executions. A job itself is essentially an object that describes the data in the job XML.

       

      Job Instance:

      A JobInstance refers to the concept of a logical job run. Let's consider a batch job that should be run once at the end of the day, such as the 'EndOfDay' job from the diagram above. There is one 'EndOfDay' Job, but each individual run of the Job must be tracked separately. In the case of this job, there will be one logical JobInstance per day. For example, there will be a January 1st run, and a January 2nd run. If the January 1st run fails the first time and is run again the next day, it is still the January 1st run. Usually this corresponds with the data it is processing as well, meaning the January 1st run processes data for January 1st, etc. Therefore, each JobInstance can have multiple executions (JobExecution is discussed in more detail below); one or many JobInstances corresponding to a particular Job can be running at a given time.

       

      The definition of a JobInstance has absolutely no bearing on the data that will be loaded. It is entirely up to the ItemReader implementation used to determine how data will be loaded. For example, in the EndOfDay scenario, there may be a column on the data that indicates the 'effective date' or 'schedule date' to which the data belongs. So, the January 1st run would only load data from the 1st, and the January 2nd run would only use data from the 2nd. Because this determination will likely be a business decision, it is left up to the ItemReader to decide. What using the same JobInstance will determine, however, is whether or not the 'state' from previous executions will be available to the new run. Using a new JobInstance will mean 'start from the beginning' and using an existing instance will generally mean 'start from where you left off'.

       

       

      Job Execution:

      A JobExecution refers to the technical concept of a single attempt to run a Job. Each time a job is started or restarted, a new JobExecution is created, belonging to the same JobInstance.

       

      --

      James R. Perkins

        • 1. Re: JBeret Job Repositories
          jamezp

          I'll try to explain my preferred solution, number 3 from above, in a bit more detail. I'll start by defining what the spec says about what a job repository is.

          7.4 Job Repository

           

          A job repository holds information about jobs currently running and jobs that have run in the past. The JobOperator interface provides access to this repository. The repository contains job instances, job executions, and step executions. For further information on this content, see sections 10.9.8, 10.9.9, 10.9.10, respectively.

           

           

           

          Note the implementation of the job repository is outside the scope of this specification.

           

          As an implementation JBeret has a job repository interface. The JobRepository has an abstract implementation that contains 3 maps for storing job information. There are currently 4 JobRepository implementations; InMemoryJobRepository, InfinispanJobRepository, JdbcJobRepository and MongoJobRepository. Each job repository extends the AbstractJobRepository which has those 3 instance maps. The job repository is returned in the BatchEnvironment and then used throughout JBeret to create job instances and executions. Generally speaking the BatchEnvironment as well as the JobRepository it returns are created once. For example in WildFly (after I fix this bug ) if the user selects a JDBC repository only one JdbcJobRepository will be created and used for all applications deployed to WildFly.

           

          The issue is using a single JobRepository instance essentially makes those maps in the AbstractJobRepository static maps. The entries in them will not be available for GC until the server is reloaded or until the user invokes the PurgeBatchlet.

           

          I'd argue all the job repositories, with the exception of the InMemoryJobRepository, should likely not extend the AbstractJobRepository as using those maps in a static way is essentially a memory leak. For the JdbcJobRepository we could use some kind of caching to hopefully make queries a bit faster. Or maybe some kind of weak reference valued map. I've put a lot of thought into how it should be implemented at this point. I was hoping to get some ideas from the group here.

           

          Hopefully that explains my concerns a bit more.

           

          --

          James R. Perkins

          • 2. Re: JBeret Job Repositories
            cfang

            +1 for option 3. We could move the bulk of AbstractRepository into InMemoryRepository itself, and morph AbstractRepository into something like AbstractPersistentRepository, which can be extended by Jdbc or MongoDB ones.  I think the ultimate caching solution would be infinispan, and so InfinispanRepository would be the best fit for those who want a fully configurable caching, once it's integrated into WildFly.  Currently for Jdbc and Mongo repository, it seems a good idea to implement some weak reference-based caching.

            • 3. Re: JBeret Job Repositories
              jamezp

              Okay excellent, thanks Cheng. I've never used it before so I don't know how well it works bu I noticed Guava has some kind of CacheBuilder. Or we could just write our own weak value map.

               

              --

              James R. Perkins