Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many ssh connections to slurm, even when the study is finished #2142

Open
insatomcat opened this issue Sep 17, 2024 · 3 comments
Open

many ssh connections to slurm, even when the study is finished #2142

insatomcat opened this issue Sep 17, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@insatomcat
Copy link
Contributor

Description

When using a slurm launcher, I can see antarest doing a lot of ssh connections to slurm (about 10-12 / second).
They start with the first launch of a study, and never stop unless I shut down the antarest container.

That leads to 2 questions:

  • how many ssh connection is antarest supposed to do to slurm ? (12/s seems a lot...)
  • is the fact that the connections never stop even though the study is done a bug?

If this is specific to my setup, any advice on how I should debug this?

Thanks.

image
@insatomcat insatomcat added the bug Something isn't working label Sep 17, 2024
@MartinBelthle
Copy link
Contributor

MartinBelthle commented Sep 27, 2024

Hello, the amount of SSH connections you do depends on how many worker you launch the app with.
Currently every worker has or creates its own tmp file inside the slurm_workspace and each one handles its own studies launch.
There's a loop in the code, method _loop inside slurm_launcher.py for the worker to ask slurm the state of the running job.
The loop executes itself every 2 seconds and does only one SSH connection (I believe) so I don't really know why you have so much connections.

Also I think that it never stop is a bug as the method stop() inside the same file is supposed to stop the loop.

@sylvlecl if you have an explanation feel free

@sylvlecl
Copy link
Member

sylvlecl commented Oct 9, 2024

Not much more explanation, would be interesting to dig into this at some point, both observations are not expected (number of open sessions and the fact that they don't stop when there is no more computation).

This should be improved when we refactor the launcher service to centralize the scanning of jobs status, in any case, but it will be interesting to monitor this.

@insatomcat
Copy link
Contributor Author

I can't reproduce how I found this, but I feel this has to do with slurm accounting. Can you tell me if antarest expect the accounting feature to be available on the slurm cluster, and if yes where is the code about this ?
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants