You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using a slurm launcher, I can see antarest doing a lot of ssh connections to slurm (about 10-12 / second).
They start with the first launch of a study, and never stop unless I shut down the antarest container.
That leads to 2 questions:
how many ssh connection is antarest supposed to do to slurm ? (12/s seems a lot...)
is the fact that the connections never stop even though the study is done a bug?
If this is specific to my setup, any advice on how I should debug this?
Thanks.
The text was updated successfully, but these errors were encountered:
Hello, the amount of SSH connections you do depends on how many worker you launch the app with.
Currently every worker has or creates its own tmp file inside the slurm_workspace and each one handles its own studies launch.
There's a loop in the code, method _loop inside slurm_launcher.py for the worker to ask slurm the state of the running job.
The loop executes itself every 2 seconds and does only one SSH connection (I believe) so I don't really know why you have so much connections.
Also I think that it never stop is a bug as the method stop() inside the same file is supposed to stop the loop.
Not much more explanation, would be interesting to dig into this at some point, both observations are not expected (number of open sessions and the fact that they don't stop when there is no more computation).
This should be improved when we refactor the launcher service to centralize the scanning of jobs status, in any case, but it will be interesting to monitor this.
I can't reproduce how I found this, but I feel this has to do with slurm accounting. Can you tell me if antarest expect the accounting feature to be available on the slurm cluster, and if yes where is the code about this ?
Thanks.
Description
When using a slurm launcher, I can see antarest doing a lot of ssh connections to slurm (about 10-12 / second).
They start with the first launch of a study, and never stop unless I shut down the antarest container.
That leads to 2 questions:
If this is specific to my setup, any advice on how I should debug this?
Thanks.
The text was updated successfully, but these errors were encountered: