Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have a way to use only part of a slurm node #165

Open
Kipok opened this issue Oct 7, 2024 · 0 comments
Open

Have a way to use only part of a slurm node #165

Kipok opened this issue Oct 7, 2024 · 0 comments

Comments

@Kipok
Copy link
Collaborator

Kipok commented Oct 7, 2024

If there are 8 GPUs on slurm nodes, there is typically a parameter that you can set to request only a subset, e.g. 1 or 2. This is very useful for inference with small models, since partitioning them across 8 GPUs is typically not necessary. While our current code supports doing this, I believe that we actually reserve the whole node still because we have https://github.com/Kipok/NeMo-Skills/blob/main/nemo_skills/pipeline/utils.py#L469.

Need to test if that's indeed the case (that we get the full node even if we request only a fraction). If that's true, need to try to remove that parameter and check if that still allows us to launch parallel srun jobs on a single node and solves the issue. Might need to read through the slurm documentation and experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant