Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully kill running tasks before walltime (Slurm) #3674

Open
pdobbelaere opened this issue Nov 4, 2024 · 2 comments
Open

Gracefully kill running tasks before walltime (Slurm) #3674

pdobbelaere opened this issue Nov 4, 2024 · 2 comments

Comments

@pdobbelaere
Copy link

Is your feature request related to a problem? Please describe.

Does Parsl (more specifically HighTroughputExecutor+SlurmProvider) provide some built-in method to notify running tasks when their job allocation is about to hit walltime? Task runtimes are not always predictable, and some option to gracefully kill a task (close/checkpoint files, prep for a restart) could prevent losing workflow progress. I know about the drain option for HTEx, but it does not affect running jobs, if I understood correctly.

Describe alternatives you've considered

  • In Slurm, you can use the --signal flag to send a signal before walltime, however I have not found an easy way to propagate this signal to tasks running through workers.
  • You could wrap bash apps with timeout (e.g, timeout 60m python myscript.py), but that does not really work for tasks started halfway through the job allocation (you don't know how much walltime will be left when any task starts).
  • Job allocation details can be found in the shell environment (with Slurm, anyway), so every app could manage and decide when to shutdown by itself. I would say this is not the responsibility of apps.
  • You could play it safe and always checkpoint periodically. Brute-forcing it should work in most scenarios, but feels somewhat inelegant.

Currently, my hacky workaround is to launch a simple background Python script - before starting the process_worker_pool - which sleeps until right before the job allocation ends and then signals any (sub)processes created by workers (see below). This approach seems to work fine, but is bound to fail under some circumstances. There must be a better/cleaner way.

graceful_shutdown.py
"""
Set a shutdown timer and kill everything
TODO: make exit_window variable
"""

import os
import signal

import psutil
import datetime


EXIT_SIGNAL = signal.SIGUSR1
EXIT_WINDOW = 30    # seconds
EXIT_CODE = 69


def signal_handler(signum, frame):
    """"""
    pass


def signal_handler_noop(signum, frame):
    """Dummy handler that does not do anything."""
    print(f'Received signal {signum} at frame {frame}')
    print('What do I do with this omg..')
    exit(EXIT_CODE)


def find_processes() -> list[psutil.Process]:
    """"""
    pkill = psutil.Process()
    pmain = pkill.parent()
    job_id, node_id = [pmain.environ().get(_) for _ in ('SLURM_JOBID', 'SLURM_NODELIST')]
    procs_node = [p for p in psutil.process_iter()]

    # only consider procs originating from this job
    procs_job, procs_denied = [], []
    for p in procs_node:
        try:
            if p.environ().get('SLURM_JOBID') == job_id:
                procs_job.append(p)
        except psutil.AccessDenied:
            procs_denied.append(p)

    pwork = [p for p in procs_job if p.name() == 'process_worker_']
    pworker = sum([p.children() for p in pwork], [])
    ptasks = sum([p.children() for p in pworker], [])

    print(
        f'Job processes (job_id={job_id}, node_id={node_id}):', *procs_job,
        'Main process :', pmain, 'Kill process:', pkill,
        'Workers:', *pworker, 'Running tasks:', *ptasks,
        sep='\n'
    )

    return ptasks


def main():
    """"""
    time_start = datetime.datetime.fromtimestamp(float(os.environ.get('SLURM_JOB_START_TIME', 0)))
    time_stop = datetime.datetime.fromtimestamp(float(os.environ.get('SLURM_JOB_END_TIME', 0)))
    duration = time_stop - time_start
    print(f'Job allocation (start|stop|duration): {time_start} | {time_stop} | {duration}')

    print('Awaiting the app-ocalypse..')
    signal.alarm((time_stop - datetime.datetime.now()).seconds - EXIT_WINDOW)
    signal.signal(signal.SIGALRM, signal_handler)
    signal.sigwait([signal.SIGALRM])
    # TODO: this kills process?
    print(f'Received signal {signal.SIGALRM.name} at {datetime.datetime.now()}')

    for p in find_processes():
        print(f'Sending {EXIT_SIGNAL.name} to process {p.pid}..')
        os.kill(p.pid, EXIT_SIGNAL)

    exit(EXIT_CODE)


if __name__ == "__main__":

    main()
job script/logs

The originating Python script controlling Parsl uses some custom code, but all of that is irrelevant. Essentially, we launch bash apps that sleep indefinitely until they catch a signal.

parsl.hpc_htex.block-0.1730729332.0925918

#!/bin/bash

#SBATCH --job-name=parsl.hpc_htex.block-0.1730729332.0925918
#SBATCH --output=/kyukon/scratch/gent/vo/000/gvo00003/vsc43633/docteur/2024_10_30_testing_parsl/graceful_exit/runinfo/000/submit_scripts/parsl.hpc_htex.block-0.1730729332.0925918.stdout
#SBATCH --error=/kyukon/scratch/gent/vo/000/gvo00003/vsc43633/docteur/2024_10_30_testing_parsl/graceful_exit/runinfo/000/submit_scripts/parsl.hpc_htex.block-0.1730729332.0925918.stderr
#SBATCH --nodes=1
#SBATCH --time=1
#SBATCH --ntasks-per-node=1

#SBATCH --mem=4g
#SBATCH --cpus-per-task=2

eval "$("$MAMBA_EXE" shell hook --shell bash --prefix "$MAMBA_ROOT_PREFIX" 2> /dev/null)"
micromamba activate main; micromamba info
export PYTHONPATH=$PYTHONPATH:$VSC_DATA_VO_USER
echo "PYTHONPATH = $PYTHONPATH"
python /user/gent/436/vsc43633/DATA/mypackage/parsl/graceful_shutdown.py &
export PARSL_MEMORY_GB=4
export PARSL_CORES=2


export JOBNAME="parsl.hpc_htex.block-0.1730729332.0925918"

set -e
export CORES=$SLURM_CPUS_ON_NODE
export NODES=$SLURM_JOB_NUM_NODES

[[ "1" == "1" ]] && echo "Found cores : $CORES"
[[ "1" == "1" ]] && echo "Found nodes : $NODES"
WORKERCOUNT=1

cat << SLURM_EOF > cmd_$SLURM_JOB_NAME.sh
process_worker_pool.py   -a 157.193.252.90,10.141.10.67,10.143.10.67,172.24.10.67,127.0.0.1 -p 0 -c 1 -m 2 --poll 10 --task_port=54670 --result_port=54326 --cert_dir None --logdir=/kyukon/scratch/gent/vo/000/gvo00003/vsc43633/docteur/2024_10_30_testing_parsl/graceful_exit/runinfo/000/hpc_htex --block_id=0 --hb_period=30  --hb_threshold=120 --drain_period=None --cpu-affinity none  --mpi-launcher=mpiexec --available-accelerators 
SLURM_EOF
chmod a+x cmd_$SLURM_JOB_NAME.sh

srun --ntasks 1 -l  bash cmd_$SLURM_JOB_NAME.sh

[[ "1" == "1" ]] && echo "Done"

parsl.hpc_htex.block-0.1730729332.0925918.stderr

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 50069807 ON node3506.doduo.os CANCELLED AT 2024-11-04T15:09:57 ***
0: slurmstepd: error: *** STEP 50069807.0 ON node3506.doduo.os CANCELLED AT 2024-11-04T15:09:57 ***

parsl.hpc_htex.block-0.1730729332.0925918.stdout

                                           __
          __  ______ ___  ____ _____ ___  / /_  ____ _
         / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
        / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
       / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
      /_/


            environment : /kyukon/data/gent/vo/000/gvo00003/vsc43633/micromamba/envs/main (active)
           env location : /kyukon/data/gent/vo/000/gvo00003/vsc43633/micromamba/envs/main
      user config files : /user/gent/436/vsc43633/.mambarc
 populated config files : /user/gent/436/vsc43633/.condarc
       libmamba version : 1.4.3
     micromamba version : 1.4.3
           curl version : libcurl/7.88.1 OpenSSL/3.1.0 zlib/1.2.13 zstd/1.5.2 libssh2/1.10.0 nghttp2/1.52.0
     libarchive version : libarchive 3.6.2 zlib/1.2.13 bz2lib/1.0.8 libzstd/1.5.2
       virtual packages : __unix=0=0
                          __linux=4.18.0=0
                          __glibc=2.28=0
                          __archspec=1=x86_64
               channels : 
       base environment : /kyukon/data/gent/vo/000/gvo00003/vsc43633/micromamba
               platform : linux-64
PYTHONPATH = :/data/gent/vo/000/gvo00003/vsc43633
Found cores : 2
Found nodes : 1
Job allocation (start|stop|duration): 2024-11-04 15:09:14 | 2024-11-04 15:10:14 | 0:01:00
Awaiting the app-ocalypse..
Received signal SIGALRM at 2024-11-04 15:09:43.442739
Job processes (job_id=50069807, node_id=node3506.doduo.os):
psutil.Process(pid=2289391, name='slurm_script', status='sleeping', started='15:09:14')
psutil.Process(pid=2289408, name='python', status='running', started='15:09:15')
psutil.Process(pid=2289411, name='srun', status='sleeping', started='15:09:15')
psutil.Process(pid=2289412, name='srun', status='sleeping', started='15:09:15')
psutil.Process(pid=2289426, name='bash', status='sleeping', started='15:09:15')
psutil.Process(pid=2289427, name='process_worker_', status='sleeping', started='15:09:15')
psutil.Process(pid=2289439, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289440, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289448, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289449, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289508, name='python', status='sleeping', started='15:09:22')
Main process :
psutil.Process(pid=2289391, name='slurm_script', status='sleeping', started='15:09:14')
Kill process:
psutil.Process(pid=2289408, name='python', status='running', started='15:09:15')
Workers:
psutil.Process(pid=2289439, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289440, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289448, name='python', status='sleeping', started='15:09:19')
psutil.Process(pid=2289449, name='python', status='sleeping', started='15:09:19')
Running tasks:
psutil.Process(pid=2289508, name='python', status='sleeping', started='15:09:22')
Sending SIGUSR1 to process 2289508..
@svandenhaute
Copy link

You can use

@benclifford
Copy link
Collaborator

You could play it safe and always checkpoint periodically. Brute-forcing it should work in most scenarios, but feels somewhat inelegant.

This is pretty much the traditional approach that parsl's worker model has had, but in recent times we've been pushing more towards managing the end of things a bit better, mostly with things like the drain time and trying to avoid placing tasks on soon-to-end workers (see also #3323).

Having the worker pool send unix signals to launched bash apps is probably an interesting thing to implement - triggered by either the external batch system or by knowledge of the environment (drain style)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants