rDVC executes DVC pipelines on SLURM clusters.
It expects an init_python_venv.sh
script at the root of the DVC repository which will set up the full environment for execution.
Install rDVC and navigate into a local copy of your DVC repository. Make sure it's synchronised with the remote (i.e. stage, commit and git push
). If correctly set up, you can kick off a remote job with
$ rdvc -S my_param=a
... # wait for the job to finish
$ dvc exp pull origin
The following steps take place on your local machine.
- copy your SSH key to the cluster:
ssh-copy-id <SLURM_CLUSTER_SSH>
- run
rdvc init
subcommands:- (one-off)
rdvc init global
- (one-off)
rdvc init remote
- (once for every project)
rdvc init project
in your project directory
- (one-off)
It is the user's responsibility to guarantee the remote job's access to any relevant secrets such as GitHub credentials. If in doubt, speak to your infrastructure team.
A (non-exhaustive) list of things to set up for your remote user account:
- Git credentials
- likely as an SSH key
- private PyPI repository keys
- Experiment tracking credentials, e.g. for Iterative Studio
- these go in
~/.config/dvc/config
- these go in
For the demo project, ensure that Python 3.11 is available on the SLURM cluster.
All options are loaded, in the order of increasing priority, from
~/.config/rdvc/config.toml
PROJECT_ROOT/.rdvc/config.toml
rdvc
CLI options (readrdvc --help
for details)
Remote runs generate a number of files. You can find them, by type, in your (remote) home directory:
- submitted jobs:
$HOME/.rdvc/submissions/%Y-%m-%d-%H-%M-%S-%f-git_hash-sbatch_script_hash
- logs:
$HOME/.rdvc/logs/slurm-$SLURM_JOB_ID.out
- job working directories:
$HOME/.rdvc/workspaces/$SLURM_JOB_ID
- default DVC cache:
$HOME/.dvc/cache
rDVC works by composing and submitting a sbatch
script. It needs to allocate the job to a partition existing in the cluster. The list (and properties) of available partitions is defined in src/rdvc/slurm/instance.py
as the Enum
class InstanceTypes
. Modify its entries to match your SLURM cluster's configuration. The current version of rDVC does not support setting up instance types via config, so you have to fork and modify rDVC source code.
You can find which partitions and node types are available by consulting sinfo
while logged into the SLURM cluster.