Have a way to use only part of a slurm node #165

Kipok · 2024-10-07T22:21:58Z

If there are 8 GPUs on slurm nodes, there is typically a parameter that you can set to request only a subset, e.g. 1 or 2. This is very useful for inference with small models, since partitioning them across 8 GPUs is typically not necessary. While our current code supports doing this, I believe that we actually reserve the whole node still because we have https://github.com/Kipok/NeMo-Skills/blob/main/nemo_skills/pipeline/utils.py#L469.

Need to test if that's indeed the case (that we get the full node even if we request only a fraction). If that's true, need to try to remove that parameter and check if that still allows us to launch parallel srun jobs on a single node and solves the issue. Might need to read through the slurm documentation and experiment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have a way to use only part of a slurm node #165

Have a way to use only part of a slurm node #165

Kipok commented Oct 7, 2024

Have a way to use only part of a slurm node #165

Have a way to use only part of a slurm node #165

Comments

Kipok commented Oct 7, 2024