A general, extensible set of Python and Bash scripts to submit any job to the HTCondor batch system.
This package focuses on the use of Condor with CMSSW via the LPC or CMS Connect.
Currently in an alpha state of development; caveat emptor.
(Created by gh-md-toc)
This recipe assumes the use of CMSSW. This repository is treated as a CMSSW package, to take advantage of automatic path setup via scram.
cmsrel [CMSSW_VERSION]
cd [CMSSW_VERSION]/src
cmsenv
git clone git@github.com:kpedro88/CondorProduction.git Condor/Production
scram b
$CMSSW_BASE/src/Condor/Production/scripts/postInstall.sh [options]
The script [./scripts/postInstall.sh] handles common post-installation actions. It can be called at the end of the installation script for packages that use CondorProduction.
Its options are:
-d [dir]
: location of CondorProduction (default =$CMSSW_BASE/src/Condor/Production
)-b [dir]
: batch directory forlinkScripts.py
-c
: runcacheAll.py
-p
: (re)install python3 bindings for htcondor-h
: display help message and exit
The actions it performs are:
- If
-b
is provided, go to that directory and link to any scripts (list provided in the.prodconfig
file (see Configuration) that will be reused when submitting jobs. (If the job directory is tracked in git, one can add the symlinked file names to a.gitignore
file in the directory to exclude them from the repository.) - If
-c
is provided, callcacheAll.py
to mark unneeded directories as excluded from the CMSSW tarball (see CMSSW tarball creation). - If
-p
is provided and python3 is available from CMSSW, the correct htcondor bindings for the system Condor will be installed. The CMSSW virtualenv will be enabled if not already in use.
Job submission (and several related functions) are handled by a class called jobSubmitter with an associated driver script submitJobs.py.
The jobSubmitter
class as provided here is intended to be used as a base class, extended by the user as necessary.
The class is designed with many modular functions that can be overriden by the user.
If the user wishes to add to the base functionality rather than just overriding it, the base class function
can be called using the Python super()
function.
If the user wishes to change or remove default options, jobSubmitter
has a member function removeOptions()
to simplify the operation of removing options from the option parser.
All defined options are added to jobSubmitter
as member variables for easier access in its member functions.
Because the logic of job organization is highly variable and task-dependent, the generateSubmission()
function is left unimplemented.
The user must implement this function to fill the list of job prototypes.
However, a function generatePerJob()
is provided to perform common operations while creating job prototypes.
There are several modes of operation for jobSubmitter
: count, prepare, submit, missing, and clean.
- count (
-c, --count
): count the expected number of jobs to submit. - prepare (
-p, --prepare
): prepare JDL file(s) and any associated job inputs. - submit (
-s, --submit
): submit jobs to Condor. - missing (
-m, --missing
): check for missing jobs. - clean (
-l, --clean
): clean up log files.
The modes count, submit, and missing are treated as mutually exclusive. The mode prepare can be run with any of them. (If prepare is not run before submit, the job submission will fail.)
At least one mode must be specified to run the script.
Internally, jobSubmitter
stores job information in a list protoJobs
, where each entry is an instance of the class protoJob
:
name
: base name for the set of jobs (see submit mode)chainName
: alternate name if jobs are run in a chain (see job chains and missing mode)nums
: list of job numbers in this set (i.e.$(Process)
values)njobs
: total number of jobs (used in count mode)jdl
: JDL filename for this set of jobsqueue
: queue command for this set of jobspatterns
:OrderedDict
of find/replace pairs to create the JDL from templateappends
: list of strings to append to the JDL
The protoJob
class also has a function makeName(num)
to make an individual job name by combining the job base name and a given job number.
This function is important to match job names with finished or running jobs in missing mode, and may also be used for other purposes.
The user can override this function if desired:
def makeNameNew(self,num):
...
protoJob.makeName = makeNameNew
This mode simply counts the number of jobs expected. This is useful if testing some automatic job splitting algorithm or other criteria to add or remove jobs from a list, or just trying to plan a large submission of jobs.
This mode creates JDL files from the template file (by default, jobExecCondor.jdl).
It uses the patterns
specified in each protoJob
to perform find-replace operations on the template file
to create a real, usable JDL file. (Currently, regular expressions are not supported.)
It then appends any requested additions (in the protoJob
appends
) to the end of the real JDL file.
The submit mode calls the condor_submit
command for each JDL file.
Typically, a user may split a given task into a large number of jobs to run in parallel.
For efficiency, it is best to submit a set of jobs to condor all at once, using the Queue N
syntax.
For each value 0 ≤ n < N, a job will be submitted; Condor internally uses the variable $(Process)
to store the value of n.
The value of $(Process)
can be used to differentiate each individual job in the set.
To allow reuse of the same JDL in case of resubmitting just a few removed jobs (see Missing mode below),
the Queue N
line in the JDL is commented out, and instead the -queue
option of condor_submit
is used.
In case this is not desired or possible for some reason (e.g. due to an old version or wrapper of condor_submit
),
the option -q, --no-queue-arg
can be used.
The missing mode looks at both output files (from finished jobs) and running jobs (in the Condor queue) to determine
if any jobs have been removed.
It has an option to make a resubmission script with a specified name: -r, --resub [script_name.sh]
.
(Otherwise, it will just print a list of missing jobs.)
It also has an option -u, --user [username]
to specify which user's jobs to check in the Condor queue.
The default value for user
can be specified in the .prodconfig
file (see Configuration).
The option -q, --no-queue-arg
can also be used here; in this case, the JDL file will be modified
with the list of jobs to be resubmitted (instead of using -queue
).
In case job chains are used, running jobs may have different names than the output files from finished jobs.
The protoJob.chainName
attribute is available to convert between the different naming schemes.
To facilitate in-place updates of output files, date ranges can be specified using the --min-date
and --max-date
options.
This mode also relies on knowledge of HTCondor collectors and schedulers. Values for the LPC and CMS Connect
are specified in the default .prodconfig
file (see Configuration).
Similar to missing mode, clean mode looks at output files and running jobs to determine which jobs have finished.
(Finished jobs have output files and are not currently running.)
It puts all the log files from the finished jobs into a compressed tar archive called "log_#.tar.gz", where "#"
is incremented based on the presence of any other log archives in the specified directory.
By default the current directory "." is specified as the output location of the log archive, but this can be
changed to any local or remote directory with the option --clean-dir [dir]
.
A typical Condor job consists of several steps:
- Environment setup
- Run executable/script
- Transfer output (stageout)
One focus of this package is standardizing the environment setup step (step1). Accordingly, jobSubmitter
has a separate set
of member functions focusing on step1. Any subsequent steps should be provided by the user; only a basic amount of setup
is provided in advance. (It is assumed by default that steps 2 and 3 will be combined in a script step2.sh
, but this
assumption can be changed easily.)
The Condor executable script is jobExecCondor.sh, which runs the subroutine scripts
step1.sh, step2.sh
(user provided), etc. Each subroutine script is sourced and
the command line arguments are reused (processed by bash getopts
). Because of this, a special syntax is used with getopts
to avoid failing on an unknown option.
The executable script also provides some bash helper functions.
getFromClassAd
: parses information from the Condor ClassAds for each job. This can be used, for example, to check the number of requested CPUs when running a multicore job.stageOut
: copies a local file to a storage element usingxrdcp
. It can retry a specified number of times, with a specified wait time that increases with each retry. This helps avoid job failure from a temporarily interrupted connection or unavailable storage element. It can also clean up by removing local files (upon successful or unsuccessful copying). Staging out can be disabled for intermediate jobs in chains (see Job chains below) by passing the argument--intermediate
tojobSubmitter
, which corresponds to the argument-I
forjobExecCondor.sh
.
The form of these scripts is tightly coupled with the operations of jobSubmitter
.
Therefore, by default the scripts are not specified as command-line arguments in Python,
but instead as a member variable in the constructor of jobSubmitter
(and must be changed explicitly in any
extension of the class by users). The script names are passed to jobExecCondor.sh
using the -S
flag.
Jobs can be run in a Singularity container using the cmssw-env script.
The name of the container (e.g. /cvmfs/unpacked.cern.ch/registry.hub.docker.com/cmssw/slc6:latest
) is passed to jobSubmitter
using the --env
flag
and to jobExecCondor.sh
using the -E
flag.
(Passing other arguments to the script is currently not supported, due to conflicts arising from limitations in Bash getopts
.)
There are existing CMSSW containers available at /cvmfs/unpacked.cern.ch/registry.hub.docker.com/cmssw/
;
other Singularity containers hosted on CVMFS can also be used.
Please note, this functionality deliberately does not use the built-in Condor Singularity support,
in order to handle job chains in which each job may use a different container (or no container).
The default step1 should be sufficient for most users. It allows a CMSSW environment to be initialized in one of 3 ways:
- transfer: copy CMSSW tarball from job directory via Condor's
Transfer_Input_Files
. - xrdcp: copy CMSSW tarball from specified directory using xrootd.
- cmsrel: create a new CMSSW area with the specified release and scram architecture.
The Python arguments for the default step1 are:
-k, --keep
: keep existing tarball (don't run atar
command)-n, --no-voms
: don't check for a VOMS grid proxy (proxy line is removed from JDL template, CMSSW environment via xrdcp not allowed)-t, --cmssw-method [method]
: how to get CMSSW env: transfer, cmsrel, or address for xrdcp (default = transfer)--scram-workaround
: avoid usingscram b ProjectRename
when setting up a CMSSW tarball
The arguments for the default step1.sh are:
-C [CMSSW_X_Y_Z]
: CMSSW release version-L [arg]
:SCRAM_ARCH
value (if using cmsrel method or workaround)-X [arg]
: CMSSW location (if using xrdcp method)
The tar command (in checkVomsTar.sh) uses flags --exclude-vcs
and --exclude-caches-all
to reduce the size of the CMSSW tarball.
The first flag, --exclude-vcs
, drops directories like .git
that may be large but don't contain any useful information for jobs.
The second flag, --exclude-caches-all
, drops any directory containing a CACHEDIR.TAG file.
A script cacheAll.py is provided to expedite the process of using CACHEDIR.TAG
files.
Directories to cache (or uncache) can be specified in .prodconfig
. Environment variables used in the directory names will be expanded.
As noted, subsequent steps should be implemented and provided by the user.
Some default Python arguments are provided in case the user is using the default JDL template file:
--jdl [filename]
: name of JDL template file for job--disk [amount]
: amount of disk space to request for job [kB] (default = 10000000)--memory [amount]
: amount of memory to request for job [MB] (default = 2000)--cpus [number]
: number of CPUs (threads) for job (default = 1)--sites [list]
: comma-separated list of sites for global pool running (if using CMS Connect) (default from.prodconfig
)--env [args]
: run in Singularity environment using cmssw-env (default = None)
A few other Python arguments are not explicitly included in the default setup, but may often be added by users:
-o, --output [dir]
: path to output directory-v, --verbose
: enable verbose output (could be a bool or an int, if different verbosity levels are desired) By default, missing mode will try to get a list of output files from theoutput
option, if it exists.
One shell argument is effectively reserved if the user wants to use the job management tools:
-x [redir]
: xrootd redirector address or site name (for reading input files)
Python
"Mode of operation" options:
-c, --count
: count the expected number of jobs to submit-p, --prepare
: prepare JDL file(s) and any associated job inputs-s, --submit
: submit jobs to Condor-m, --missing
: check for missing jobs--min-date
: minimum date for files in missing mode--max-date
: minimum date for files in missing mode-r, --resub [script_name.sh]
: create resubmission script-l, --clean
: clean up log files--clean-dir
: output dir for log file .tar.gz (default = ".")-u, --user [username]
: view jobs from this user (default from.prodconfig
)-q, --no-queue-arg
: don't use -queue argument in condor_submit
Default step1 options:
-k, --keep
: keep existing tarball (don't run atar
command)-n, --no-voms
: don't check for a VOMS grid proxy (proxy line is removed from JDL template, CMSSW environment via xrdpc not allowed)-t, --cmssw-method [method]
: how to get CMSSW env: transfer, cmsrel, or address for xrdcp (default = transfer)--scram-workaround
: avoid usingscram b ProjectRename
when setting up a CMSSW tarball
Default extra options:
--jdl [filename]
: name of JDL template file for job--disk [amount]
: amount of disk space to request for job [kB] (default = 1000000)--memory [amount]
: amount of memory to request for job [MB] (default = 2000)--cpus [number]
: number of CPUs (threads) for job (default = 1)--sites [list]
: comma-separated list of sites for global pool running (if using CMS Connect) (default from.prodconfig
)--env [args]
: args to run job in Singularity environment using cmssw-env (default = None)--intermediate
: specify that this is an intermediate job in a chain to disable staging out--singularity [image]
: specify singularity image for job (default = "")
"Reserved", but not actually used by default:
-o, --output [dir]
: path to output directory-v, --verbose
: enable verbose output (could be a bool or an int, if different verbosity levels are desired)
Shell
"Mode of operation" options:
-S [scripts]
: comma-separated list of subroutine scripts to run-E [args]
: args to run job in Singularity environment using cmssw-env-I
: indicate that this job is an intermediate job in a chain (see Job chains below); this disablesstageOut
if used in Step2
Default step1 options:
-C [CMSSW_X_Y_Z]
: CMSSW release version-L [arg]
:SCRAM_ARCH
value (if using cmsrel method or workaround)-X [arg]
: CMSSW location (if using xrdcp method)
Default extra options:
none
"Reserved", but not actually used by default:
-x [redir]
: xrootd redirector address or site name (for reading input files)
Jobs submitted with jobSubmitter
and the default JDL file are set up so that they are held if they exit unsuccessfully
(with a signal or non-zero exit code). Failed jobs therefore stay in the queue and can be released to run again, assuming
the problem is understood. (Problems reading input files over xrootd are common.)
The Python script manageJobs.py can list information about jobs (held or otherwise) and release them if desired. It uses a number of command line options to specify how to display job information, modify jobs, and change job statuses:
-c, --coll [collector]
: view jobs from this collector (use collector of current machine by default)-u, --user [username]
: view jobs from this user (submitter) (default taken from.prodconfig
)-a, --all
: view jobs from all schedulers (use scheduler of current machine by default)-h, --held
: view only held jobs-r, --running
: view only running jobs-i, --idle
: view only idle jobs-f, --finished [n]
: view only n finished jobs-t, --stuck
: view only stuck jobs (subset of running)-g, --grep [patterns]
: view jobs with [comma-separated list of strings] in the job name or hold reason-v, --vgrep [patterns]
: view jobs without [comma-separated list of strings] in the job name or hold reason-o, --stdout
: print stdout filenames instead of job names-n, --num
: print job numbers along with names-x, --xrootd [redir]
: edit the xrootd redirector (or site name) of the job input-e, --edit [edit]
: edit the ClassAds of the job (JSON dict format)-s, --resubmit
: resubmit (release) the selected jobs-k, --kill
: remove the selected jobs-d DIR, --dir=DIR
: directory for stdout files (used for backup when resubmitting) (default taken from.prodconfig
)-w, --why
: show why a job was held-m, --matched
: show site and machine to which the job matched (for CMS Connect)-p, --progress
: show job progress (time and nevents)-X, --xrootd-resubmit
: resubmit the jobs based on where the input files are located-B BLACKLISTEDSITES, --blacklist-sites=BLACKLISTEDSITES
: comma-separated list of global pool sites to reject-C INPUTFILECLASSAD, --input-file-classad=INPUTFILECLASSAD
: HTCondor ClassAd which contains the input file(s) being used within the job (used in combination with-X
)-D, --dry-run
: don't actually resubmit any jobs (used in combination with-X
)-K LOGKEY, --log_key=LOGKEY
: key to use to find the correct line(s) in the log file (used in combination with-X
)-L LOGPATH, --log_path=LOGPATH
: path to the job logs (used in combination with-X
, default =pwd
)-U, --prefer-us-sites
: prefer reading inputs from US sites over others (used in combination with-X
)-V, --verbose
: be more verbose when printing out the resubmission information for each job (used in combination with-X
)--add-sites=ADDSITES
: comma-separated list of global pool sites to add--rm-sites=RMSITES
: comma-separated list of global pool sites to remove--stuck-threshold [num]
: threshold in hours to define stuck jobs (default = 12)--ssh
: internal option if script is run recursively over ssh--help
: show help message and exit
The options -h
, -i
, -r
, -f
are exclusive. The options -s
and -k
are also exclusive. The option -a
is currently only supported
at the LPC (where each interactive node has its own scheduler). The script can ssh to each node and run itself to modify the jobs
on that node (because each scheduler can only be accessed for write operations from its respective node).
The option -X
will resubmit the jobs (just like -s
), except that it will tell the job to get its input file from a specific site based on the list of sites where that file
is located and some user preferences. Therefore, the options -X
and -s
are exclusive, as are the options -X
and -k
. The options -B
, -C
, -D
, -K
, -L
, -U
, and -V
are only applicable when using the option -X
. There are two ways in which the program can find the appropriate input file(s) for the job:
- specify an HTCondor ClassAd name, where the ClassAd contains a comma-separated list of files (
-C
). - specify the location of a set of log files and a key with which to parse the log files (
-L
/-K
). It is expected that the log will contain a single line with a comma-separated list of files.
The -x
and -X
options are similar. If just the -x
option is used, all of the resubmitted jobs will try to access the input files from a user-specified location. If just the -X
option is used, the user doesn't specify a particular location, but a set of loose preferences (see below). If both the -X
and -x
options are used, then the jobs will preferentially read their input from the site found automatically, with the redirector or site name given by -x
as a fallback if no suitable site is found.
When using the -X
option, a user may indicate a preference for specific sites in two ways:
- The user can broadly prefer sites located in the United States using the
-U
option - The user can specify a list of preferred sites using the attribute
preferredsites
in the.prodconfig
file (see Configuration)
Limitation: the -X
option relies upon dasgoclient for finding the site information for a given file. It is therefore limited by the accuracy of DAS and only works for centrally produced/tracked CMS input files.
Multiple jobs can be chained together in order to run in series.
(This may be useful, for example, to avoid storing large intermediate output files, in order to reduce disk usage requirements.)
A script createChain.py is provided to create these chains from individual jobs.
It produces an aggregated input tarball and corresponding JDL file that can be submitted using condor_submit
and executes jobExecCondorChain.sh.
The aggregrate input tarball contains a directory for each job, named (in order) job0
, job1
, etc.
Each job is executed in its own directory, with its own environment.
This regular structure can be used during the preparation of the individual jobs;
for example, job1
can refer to the expected output file from job0
using the relative path ../job0/[file]
.
The python script's options are:
-h, --help
: show help message and exit-n NAME, --name NAME
: name for chain job (required)-j JDLS [JDLS ...], --jdls JDLS [JDLS ...]
: full paths to JDL files (at least one required)-l LOG, --log LOG
: log name prefix from first job (will be replaced w/ chain job name)-c, --checkpoint
: enable checkpointing (if a job fails, save output files from previous job in chain)
The shell script's options are:
-J [jobname]
: name for chain job-N [number]
: number of jobs in chain-P [process]
: process number (used to substitute for$(Process)
if found in individual job arguments)-C
: enable checkpointing (see above)
Several caveats currently apply:
- The argument
-q, --no-queue-arg
should be used when preparing individual jobs. - It is recommended to use the
xrdcp
method for transferring CMSSW environment in Step1 when preparing individual jobs. - The aggregate input tarball must be transferred via Condor, not via xrdcp.
Both job submission and management can be configured by a shared config file called .prodconfig
.
A default version is provided in this repository. It is parsed by parseConfig.py
using the Python ConfigParser. The config parsing looks for .prodconfig
in the following locations, in order (later files supersede earlier ones):
.prodconfig
from this repository.prodconfig
from current directory (e.g. user's job submission directory).prodconfig
from user's home area
Expected categories and values:
common
user = [username]
: typically specified in user's home areascripts = [script(s)]
: list of scripts to link from this repository to user job directory (comma-separated list)
submit
sites = [sites]
: sites for global pool submission (comma-separated list)
manage
:dir = [dir]
: backup directory for logs from failing jobsdefaultredir = [dir]
: default xrootd redirector (if using site name)preferredsites = [site(s)]
: list of preferred sites for input file specific job resubmission (comma-separated list, descending order of preference)
collectors
:[name] = [server(s)]
: name and associated collector server(s) (comma-separated list)
schedds
:[name] = [server(s)]
: name and associated schedd server(s) (comma-separated list)
caches
:[dir] = [val]
: directory and cache status (1 = cache, 0 = uncache) (one entry per directory)
The name used for the collector and associated schedd(s) must match.
Limitation: if information for Python scripts used directly from this repository is specified in the .prodconfig
file in location 2
(user job directory), the associated Python script must be run from location 2 in order to pick up the specified information. Current cases:
- log backup directory for
manageJobs.py
- list of scripts for
linkScripts.py
This repository works with Python 2.7 or 3+ and any modern version of bash (4+).
The missing mode of jobSubmitter
uses the Condor python bindings
to check the list of running jobs. It will try very hard to find the Condor python bindings, but if they are not available,
it will simply skip the check of running jobs.
In contrast, manageJobs
absolutely depends on the Condor python bindings. It will also try very hard to find them,
but if they are not available, it cannot run.
For Python 3 usage, the bindings can be installed by postInstall.sh
(see Installation).
For more information about global pool sites, see Selecting Sites - CMS Connect Handbook.
A very basic example can be found in the test directory.
Other repositories that use this package include:
- TreeMaker - ntuple production
- SVJProduction - private signal production
- SVJtagger - BDT training
- SimDenoising - simulated data generation
- SonicCMS - ML inference