Merge pull request #183 from jhiemstrawisc/snakemake-long

Enable long Snakemake workflows in HTCondor
Reed-CompBio · Oct 15, 2024 · a26f4d0 · a26f4d0
2 parents 624cbb4 + 3f3fdc8
commit a26f4d0
Show file tree

Hide file tree

Showing 4 changed files with 181 additions and 3 deletions.
diff --git a/docker-wrappers/SPRAS/README.md b/docker-wrappers/SPRAS/README.md
@@ -68,6 +68,8 @@ git clone https://github.com/Reed-CompBio/spras.git
 
 **Note:** To work with SPRAS in HTCondor, it is recommended that you build an Apptainer image instead of using Docker. See [Converting Docker Images to Apptainer/Singularity Images](#converting-docker-images-to-apptainersingularity-images) for instructions. Importantly, the Apptainer image must be built for the linux/amd64 architecture. Most HTCondor APs will have `apptainer` installed, but they may not have `docker`. If this is the case, you can build the image with Docker on your local machine, push the image to Docker Hub, and then convert it to Apptainer's `sif` format on the AP.
 
+**Note:** It is best practice to make sure that the Snakefile you copy for your workflow is the same version as the Snakefile baked into your workflow's container image. When this workflow runs, the Snakefile you just copied will be used during remote execution instead of the Snakefile from the container. As a result, difficult-to-diagnose versioning issues may occur if the version of SPRAS in the remote container doesn't support the Snakefile on your current branch. The safest bet is always to create your own image so you always know what's inside of it.
+
 There are currently two options for running SPRAS with HTCondor. The first is to submit all SPRAS jobs to a single remote Execution Point (EP). The second
 is to use the Snakemake HTCondor executor to parallelize the workflow by submitting each job to its own EP.
 
@@ -104,13 +106,26 @@ cp ../../Snakefile . && \
 cp -r ../../input .
 ```
 
-**Note:** It is best practice to make sure that the Snakefile you copy for your workflow is the same version as the Snakefile baked into your workflow's container image. When this workflow runs, the Snakefile you just copied will be used during remote execution instead of the Snakefile from the container. As a result, difficult-to-diagnose versioning issues may occur if the version of SPRAS in the remote container doesn't support the Snakefile on your current branch. The safest bet is always to create your own image so you always know what's inside of it.
+Instead of editing `spras.sub` to define the workflow, this scenario requires editing the SPRAS profile in `spras_profile/config.yaml`. Make sure you specify the correct container, and change any other config values needed by your workflow (defaults are fine in most cases).
 
-To start the workflow with HTCondor in the CHTC pool, run:
+Then, to start the workflow with HTCondor in the CHTC pool, there are two options:
+
+#### Snakemake From Your Own Terminal
+The first option is to run Snakemake in a way that ties its execution to your terminal. This is good for testing short workflows and running short jobs. The downside is that closing your terminal causes the process to exit, removing any unfinished jobs. To use this option, invoke Snakemake directly by running:
 ```bash
 snakemake --profile spras_profile
 ```
 
+#### Long Running Snakemake Jobs (Managed by HTCondor)
+The second option is to let HTCondor manage the Snakemake process, which allows the jobs to run as long as needed. Instead of seeing Snakemake output directly in your terminal, you'll be able to see it in a specified log file. To use this option, make sure `snakemake_long.py` is executable (you can run `chmod +x snakemake_long.py` from the AP to make sure it is), and then run:
+```
+./snakemake_long.py --profile spras_profile --htcondor-jobdir <path/to/logging/directory>
+```
+
+When run in this mode, all log files for the workflow will be placed into the path you provided for the logging directory. In particular, Snakemake's outputs with job progress can be found split between `<logdir>/snakemake-long.err` and `<logdir>/snakemake-long.out`. These will also log each rule and what HTCondor job ID was submitted for that rule (see the [troubleshooting section](#troubleshooting) for information on how to use these extra log files).
+
+### Adjusting Resources
+
 Resource requirements can be adjusted as needed in `spras_profile/config.yaml`, and HTCondor logs for this workflow can be found in `.snakemake/htcondor`.
 You can set a different log directory by adding `htcondor-jobdir: /path/to/dir` to the profile's configuration.
 
@@ -137,6 +152,13 @@ contain useful debugging clues about what may have gone wrong.
 the version of SPRAS you want to test, and push the image to your image repository. To use that container in the workflow, change the `container_image` line of
 `spras.sub` to point to the new image.
 
+### Troubleshooting
+Some errors Snakemake might encounter while executing rules in the workflow boil down to bad luck in a distributed, heterogeneous computational environment, and it's expected that some errors can be solved simply by rerunning. If you encounter a Snakemake error, try restarting the workflow to see if the same error is generated in the same rule a second time -- repeatable, identical failures are more likely to indicate a more fundamental issue that might require user intervention to fix.
+
+To investigate issues, start by referring to your logging directory. Each Snakemake rule submitted to HTCondor will log a corresponding HTCondor job ID in the Snakemake standard out/error. You can use this job ID to check the standard out, standard error, and HTCondor job log for that specific rule. In some cases the error will indicate a user-solvable issue, e.g. "input file not found" might point to a typo in some part of your workflow. In other cases, errors might be solved by retrying the workflow, which causes Snakemake to pick up where it left off.
+
+If your workflow gets stuck on the same error after multiple consecutive retries and prevents your workflow from completing, this indicates some user/developer intervention is likely required. If you choose to open a github issue, please include a description of the error(s) and what troubleshooting steps you've already taken.
+
 ## Versions:
 
 The versions of this image match the version of the spras package within it.

diff --git a/docker-wrappers/SPRAS/example_config.yaml b/docker-wrappers/SPRAS/example_config.yaml
@@ -149,3 +149,5 @@ analysis:
         linkage: 'ward'
         # 'euclidean', 'manhattan', 'cosine'
         metric: 'euclidean'
+      evaluation:
+        include: false
diff --git a/docker-wrappers/SPRAS/snakemake_long.py b/docker-wrappers/SPRAS/snakemake_long.py
@@ -0,0 +1,149 @@
+#!/usr/bin/env python3
+
+"""
+A wrapper script that allows long-term Snakemake workflows to run on HTCondor. This works
+by submitting a local universe job responsible for overseeing the terminal session that
+runs the actual snakemake executable.
+"""
+
+import argparse
+import os
+import pathlib
+import subprocess
+import sys
+import time
+
+import htcondor
+
+"""
+Parse various arguments for the script. Note that this script has two "modes" of operation which
+need different arguments. The "top" mode is for submitting the HTCondor wrapper, and the "long" mode
+is for running the Snakemake command itself.
+"""
+def parse_args(isLocal=False):
+    parser = argparse.ArgumentParser(description="A tool for long-running Snakemake jobs with HTCondor.")
+    if isLocal:
+        # We add a special command that allows this singular executable to serve two purposes. The executable
+        # is first run by the user with their args to submit the local universe job. Then, the local universe
+        # job runs `snakemake_long.py long <user args>` to indicate to the script that it's time to submit the
+        # long-running Snakemake process instead of submitting another local universe job.
+        parser.add_argument("command", help="Helper command to run", choices=["long"])
+    parser.add_argument("--snakefile", help="The Snakefile to run. If omitted, the Snakefile is assumed to be in the current directory.", required=False)
+    parser.add_argument("--profile", help="A path to a directory containing the desired Snakemake profile.", required=True)
+    # I'd love to change this to "logdir", but using the same name as Snakemake for consistency of feeling between this script
+    # and Snakemake proper.
+    parser.add_argument("--htcondor-jobdir", help="The directory Snakemake will write logs to. If omitted, a 'logs` directory will be created in the current directory", required=False)
+    return parser.parse_args()
+
+"""
+Given a Snakefile, profile, and HTCondor job directory, submit a local universe job that runs
+Snakemake from the context of the submission directory.
+"""
+def submit_local(snakefile, profile, htcondor_jobdir):
+    # Get the location of this script, which also serves as the executable for the condor job.
+    script_location = pathlib.Path(__file__).resolve()
+
+    submit_description = htcondor.Submit({
+        "executable":              script_location,
+        # We use the "long" command to indicate to the script that it should run the Snakemake command instead of submitting another job.
+        # See comment in parse_args for more information.
+        "arguments":               f"long --snakefile {snakefile} --profile {profile} --htcondor-jobdir {htcondor_jobdir}",
+        "universe":                "local",
+        "request_disk":            "512MB",
+        "request_cpus":            1,
+        "request_memory":          512,
+
+        # Set up logging
+        "log":                     f"{htcondor_jobdir}/snakemake.log",
+        "output":                  f"{htcondor_jobdir}/snakemake.out",
+        "error":                   f"{htcondor_jobdir}/snakemake.err",
+
+        # Specify `getenv` so that our script uses the appropriate environment
+        # when it runs in local universe. This allows the job to access
+        # modules we've installed in the submission environment (notably spras).
+        "getenv":                  "true",
+
+        "JobBatchName":            f"spras-{time.strftime('%Y%m%d-%H%M%S')}",
+    })
+
+    schedd = htcondor.Schedd()
+    submit_result = schedd.submit(submit_description)
+
+    print("Snakemake management job was submitted with JobID %d.0. Logs can be found in %s" % (submit_result.cluster(), htcondor_jobdir))
+
+"""
+The top level function for the script that handles file creation/validation and triggers submission of the
+wrapper job.
+"""
+def top_main():
+    args = parse_args()
+
+    # Check if the snakefile is provided. If not, assume it's in the current directory.
+    if args.snakefile is None:
+        cwd = os.getcwd()
+        args.snakefile = pathlib.Path(cwd) / "Snakefile"
+    if not os.path.exists(args.snakefile):
+        raise FileNotFoundError(f"Error: The Snakefile {args.snakefile} does not exist.")
+
+    # Make sure the profile directory exists. It's harder to check if it's a valid profile at this level
+    # so that will be left to Snakemake.
+    if not os.path.exists(args.profile):
+        raise FileNotFoundError(f"Error: The profile directory {args.profile} does not exist.")
+
+    # Make sure we have a value for the log directory and that the directory exists.
+    if args.htcondor_jobdir is None:
+        args.htcondor_jobdir = pathlib.Path(os.getcwd()) / "snakemake-long-logs"
+        if not os.path.exists(args.htcondor_jobdir):
+            os.makedirs(args.htcondor_jobdir)
+    else:
+        if not os.path.exists(args.htcondor_jobdir):
+            os.makedirs(args.htcondor_jobdir)
+
+    try:
+        submit_local(args.snakefile, args.profile, args.htcondor_jobdir)
+    except Exception as e:
+        print(f"Error: Could not submit local universe job. {e}")
+        raise
+
+"""
+Command to activate conda environment and run Snakemake. This is run by the local universe job, not the user.
+"""
+def long_main():
+    args = parse_args(True)
+
+    # Note that we need to unset APPTAINER_CACHEDIR in this case but not in the local terminal case because the wrapper
+    # HTCondor job has a different environment and populating this value causes Snakemake to fail when it tries to write
+    # to spool (a read-only filesystem from the perspective of the EP job).
+    command = f"""
+    source $(conda info --base)/etc/profile.d/conda.sh && \
+    conda activate spras && \
+    unset APPTAINER_CACHEDIR && \
+    snakemake -s {args.snakefile} --profile {args.profile} --htcondor-jobdir {args.htcondor_jobdir}
+    """
+
+    try:
+        subprocess.run(command, shell=True, executable='/bin/bash', check=True)
+        return 0
+    except subprocess.CalledProcessError as e:
+        print(f"Error: Command '{e.cmd}' returned non-zero exit status {e.returncode}.")
+        raise
+    except Exception as e:
+        print(f"Unexpected error: {e}")
+        raise
+
+def main():
+    try:
+        if len(sys.argv) > 1 and sys.argv[1] in ["long"]:
+            return long_main()
+        else:
+            top_main()
+
+    except subprocess.CalledProcessError as e:
+        print(f"Error: Snakemake failed with return code {e.returncode}.")
+        sys.exit(e.returncode)
+    except Exception as e:
+        print(f"Unexpected error: {e}")
+        sys.exit(1)
+
+if __name__ == '__main__':
+    main()
diff --git a/docker-wrappers/SPRAS/spras_profile/config.yaml b/docker-wrappers/SPRAS/spras_profile/config.yaml
@@ -8,14 +8,19 @@ configfile: example_config.yaml
 # Indicate to the plugin that jobs running on various EPs do not share a filesystem with
 # each other, or with the AP.
 shared-fs-usage: none
+# Distributed, heterogeneous computational environments are a wild place where strange things
+# can happen. If something goes wrong, try again up to 5 times. After that, we assume there's
+# a real error that requires user/admin intervention
+retries: 5
 
 # Default resources will apply to all workflow steps. If a single workflow step fails due
 # to insufficient resources, it can be re-run with modified values. Snakemake will handle
 # picking up where it left off, and won't re-run steps that have already completed.
 default-resources:
   job_wrapper: "spras.sh"
   # If running in CHTC, this only works with apptainer images
-  container_image: "spras.sif"
+  # Note requirement for quotes around the image name
+  container_image: "'spras-v0.2.0.sif'"
   universe: "container"
   # The value for request_disk should be large enough to accommodate the runtime container
   # image, any additional PRM container images, and your input data.