Gracefully kill running tasks before walltime (Slurm) #3674

pdobbelaere · 2024-11-04T16:14:42Z

Is your feature request related to a problem? Please describe.

Does Parsl (more specifically HighTroughputExecutor+SlurmProvider) provide some built-in method to notify running tasks when their job allocation is about to hit walltime? Task runtimes are not always predictable, and some option to gracefully kill a task (close/checkpoint files, prep for a restart) could prevent losing workflow progress. I know about the drain option for HTEx, but it does not affect running jobs, if I understood correctly.

Describe alternatives you've considered

In Slurm, you can use the --signal flag to send a signal before walltime, however I have not found an easy way to propagate this signal to tasks running through workers.
You could wrap bash apps with timeout (e.g, timeout 60m python myscript.py), but that does not really work for tasks started halfway through the job allocation (you don't know how much walltime will be left when any task starts).
Job allocation details can be found in the shell environment (with Slurm, anyway), so every app could manage and decide when to shutdown by itself. I would say this is not the responsibility of apps.
You could play it safe and always checkpoint periodically. Brute-forcing it should work in most scenarios, but feels somewhat inelegant.

Currently, my hacky workaround is to launch a simple background Python script - before starting the process_worker_pool - which sleeps until right before the job allocation ends and then signals any (sub)processes created by workers (see below). This approach seems to work fine, but is bound to fail under some circumstances. There must be a better/cleaner way.

graceful_shutdown.py

"""
Set a shutdown timer and kill everything
TODO: make exit_window variable
"""

import os
import signal

import psutil
import datetime


EXIT_SIGNAL = signal.SIGUSR1
EXIT_WINDOW = 30    # seconds
EXIT_CODE = 69


def signal_handler(signum, frame):
    """"""
    pass


def signal_handler_noop(signum, frame):
    """Dummy handler that does not do anything."""
    print(f'Received signal {signum} at frame {frame}')
    print('What do I do with this omg..')
    exit(EXIT_CODE)


def find_processes() -> list[psutil.Process]:
    """"""
    pkill = psutil.Process()
    pmain = pkill.parent()
    job_id, node_id = [pmain.environ().get(_) for _ in ('SLURM_JOBID', 'SLURM_NODELIST')]
    procs_node = [p for p in psutil.process_iter()]

    # only consider procs originating from this job
    procs_job, procs_denied = [], []
    for p in procs_node:
        try:
            if p.environ().get('SLURM_JOBID') == job_id:
                procs_job.append(p)
        except psutil.AccessDenied:
            procs_denied.append(p)

    pwork = [p for p in procs_job if p.name() == 'process_worker_']
    pworker = sum([p.children() for p in pwork], [])
    ptasks = sum([p.children() for p in pworker], [])

    print(
        f'Job processes (job_id={job_id}, node_id={node_id}):', *procs_job,
        'Main process :', pmain, 'Kill process:', pkill,
        'Workers:', *pworker, 'Running tasks:', *ptasks,
        sep='\n'
    )

    return ptasks


def main():
    """"""
    time_start = datetime.datetime.fromtimestamp(float(os.environ.get('SLURM_JOB_START_TIME', 0)))
    time_stop = datetime.datetime.fromtimestamp(float(os.environ.get('SLURM_JOB_END_TIME', 0)))
    duration = time_stop - time_start
    print(f'Job allocation (start|stop|duration): {time_start} | {time_stop} | {duration}')

    print('Awaiting the app-ocalypse..')
    signal.alarm((time_stop - datetime.datetime.now()).seconds - EXIT_WINDOW)
    signal.signal(signal.SIGALRM, signal_handler)
    signal.sigwait([signal.SIGALRM])
    # TODO: this kills process?
    print(f'Received signal {signal.SIGALRM.name} at {datetime.datetime.now()}')

    for p in find_processes():
        print(f'Sending {EXIT_SIGNAL.name} to process {p.pid}..')
        os.kill(p.pid, EXIT_SIGNAL)

    exit(EXIT_CODE)


if __name__ == "__main__":

    main()

job script/logs

The originating Python script controlling Parsl uses some custom code, but all of that is irrelevant. Essentially, we launch bash apps that sleep indefinitely until they catch a signal.

parsl.hpc_htex.block-0.1730729332.0925918

#!/bin/bash

#SBATCH --job-name=parsl.hpc_htex.block-0.1730729332.0925918
#SBATCH --output=/kyukon/scratch/gent/vo/000/gvo00003/vsc43633/docteur/2024_10_30_testing_parsl/graceful_exit/runinfo/000/submit_scripts/parsl.hpc_htex.block-0.1730729332.0925918.stdout
#SBATCH --error=/kyukon/scratch/gent/vo/000/gvo00003/vsc43633/docteur/2024_10_30_testing_parsl/graceful_exit/runinfo/000/submit_scripts/parsl.hpc_htex.block-0.1730729332.0925918.stderr
#SBATCH --nodes=1
#SBATCH --time=1
#SBATCH --ntasks-per-node=1

#SBATCH --mem=4g
#SBATCH --cpus-per-task=2

eval "$("$MAMBA_EXE" shell hook --shell bash --prefix "$MAMBA_ROOT_PREFIX" 2> /dev/null)"
micromamba activate main; micromamba info
export PYTHONPATH=$PYTHONPATH:$VSC_DATA_VO_USER
echo "PYTHONPATH = $PYTHONPATH"
python /user/gent/436/vsc43633/DATA/mypackage/parsl/graceful_shutdown.py &
export PARSL_MEMORY_GB=4
export PARSL_CORES=2


export JOBNAME="parsl.hpc_htex.block-0.1730729332.0925918"

set -e
export CORES=$SLURM_CPUS_ON_NODE
export NODES=$SLURM_JOB_NUM_NODES

[[ "1" == "1" ]] && echo "Found cores : $CORES"
[[ "1" == "1" ]] && echo "Found nodes : $NODES"
WORKERCOUNT=1

cat << SLURM_EOF > cmd_$SLURM_JOB_NAME.sh
process_worker_pool.py   -a 157.193.252.90,10.141.10.67,10.143.10.67,172.24.10.67,127.0.0.1 -p 0 -c 1 -m 2 --poll 10 --task_port=54670 --result_port=54326 --cert_dir None --logdir=/kyukon/scratch/gent/vo/000/gvo00003/vsc43633/docteur/2024_10_30_testing_parsl/graceful_exit/runinfo/000/hpc_htex --block_id=0 --hb_period=30  --hb_threshold=120 --drain_period=None --cpu-affinity none  --mpi-launcher=mpiexec --available-accelerators 
SLURM_EOF
chmod a+x cmd_$SLURM_JOB_NAME.sh

srun --ntasks 1 -l  bash cmd_$SLURM_JOB_NAME.sh

[[ "1" == "1" ]] && echo "Done"

parsl.hpc_htex.block-0.1730729332.0925918.stderr

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 50069807 ON node3506.doduo.os CANCELLED AT 2024-11-04T15:09:57 ***
0: slurmstepd: error: *** STEP 50069807.0 ON node3506.doduo.os CANCELLED AT 2024-11-04T15:09:57 ***

parsl.hpc_htex.block-0.1730729332.0925918.stdout

                                           __
          __  ______ ___  ____ _____ ___  / /_  ____ _
         / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
        / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
       / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
      /_/


            environment : /kyukon/data/gent/vo/000/gvo00003/vsc43633/micromamba/envs/main (active)
           env location : /kyukon/data/gent/vo/000/gvo00003/vsc43633/micromamba/envs/main
      user config files : /user/gent/436/vsc43633/.mambarc
 populated config files : /user/gent/436/vsc43633/.condarc
       libmamba version : 1.4.3
     micromamba version : 1.4.3
           curl version : libcurl/7.88.1 OpenSSL/3.1.0 zlib/1.2.13 zstd/1.5.2 libssh2/1.10.0 nghttp2/1.52.0
     libarchive version : libarchive 3.6.2 zlib/1.2.13 bz2lib/1.0.8 libzstd/1.5.2
       virtual packages : __unix=0=0
                          __linux=4.18.0=0
                          __glibc=2.28=0
                          __archspec=1=x86_64
               channels : 
       base environment : /kyukon/data/gent/vo/000/gvo00003/vsc43633/micromamba
               platform : linux-64
PYTHONPATH = :/data/gent/vo/000/gvo00003/vsc43633
Found cores : 2
Found nodes : 1
Job allocation (start|stop|duration): 2024-11-04 15:09:14 | 2024-11-04 15:10:14 | 0:01:00
Awaiting the app-ocalypse..
Received signal SIGALRM at 2024-11-04 15:09:43.442739
Job processes (job_id=50069807, node_id=node3506.doduo.os):
psutil.Process(pid=2289391, name='slurm_script', status='sleeping', started='15:09:14')
psutil.Process(pid=2289408, name='python', status='running', started='15:09:15')
psutil.Process(pid=2289411, name='srun', status='sleeping', started='15:09:15')
psutil.Process(pid=2289412, name='srun', status='sleeping', started='15:09:15')
psutil.Process(pid=2289426, name='bash', status='sleeping', started='15:09:15')
psutil.Process(pid=2289427, name='process_worker_', status='sleeping', started='15:09:15')
psutil.Process(pid=2289439, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289440, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289448, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289449, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289508, name='python', status='sleeping', started='15:09:22')
Main process :
psutil.Process(pid=2289391, name='slurm_script', status='sleeping', started='15:09:14')
Kill process:
psutil.Process(pid=2289408, name='python', status='running', started='15:09:15')
Workers:
psutil.Process(pid=2289439, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289440, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289448, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289449, name='python', status='sleeping', started='15:09:19')
Running tasks:
psutil.Process(pid=2289508, name='python', status='sleeping', started='15:09:22')
Sending SIGUSR1 to process 2289508..

The text was updated successfully, but these errors were encountered:

svandenhaute · 2024-11-05T11:03:27Z

You can use

parsl/parsl/executors/high_throughput/executor.py

Line 135 in 6dd5845

drain_period : int

benclifford · 2024-11-08T12:30:39Z

You could play it safe and always checkpoint periodically. Brute-forcing it should work in most scenarios, but feels somewhat inelegant.

This is pretty much the traditional approach that parsl's worker model has had, but in recent times we've been pushing more towards managing the end of things a bit better, mostly with things like the drain time and trying to avoid placing tasks on soon-to-end workers (see also #3323).

Having the worker pool send unix signals to launched bash apps is probably an interesting thing to implement - triggered by either the external batch system or by knowledge of the environment (drain style)

pdobbelaere added the enhancement label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully kill running tasks before walltime (Slurm) #3674

Gracefully kill running tasks before walltime (Slurm) #3674

pdobbelaere commented Nov 4, 2024

svandenhaute commented Nov 5, 2024

benclifford commented Nov 8, 2024

Gracefully kill running tasks before walltime (Slurm) #3674

Gracefully kill running tasks before walltime (Slurm) #3674

Comments

pdobbelaere commented Nov 4, 2024

svandenhaute commented Nov 5, 2024

benclifford commented Nov 8, 2024