Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task resiliency if cancelled before any time has elapsed #889

Open
pagrubel opened this issue Jul 25, 2024 · 2 comments
Open

Task resiliency if cancelled before any time has elapsed #889

pagrubel opened this issue Jul 25, 2024 · 2 comments
Assignees

Comments

@pagrubel
Copy link
Collaborator

There was a problem with a node failure as a job was being submitted. This caused an automatic cancellation of the job and task. So it was a system problem not a BEE problem. Should we check time elapsed and if it is zero resubmit the task.

Here is the example status of the job:

--------------- --------------- --------- --------------- ---------- ---------- ------------------- ------------------- ---------- 
       15926486 cat-5793969dbb+     kvats           cn213    general CANCELLED+ 2024-07-24T09:30:44 2024-07-24T09:30:44   00:00:00 
 15926486.batch           batch                     cn213             CANCELLED 2024-07-24T09:30:44 2024-07-24T09:30:44   00:00:00 
15926486.extern          extern                     cn213             CANCELLED 2024-07-24T09:30:44 2024-07-24T09:30:44   00:00:00 

FI I was able to get this information using the following commands:

export SACCT_FORMAT=jobid%15,jobname%15,user,nodelist,partition,state,start,end,elapsed
sacct -j 15925486
@pagrubel
Copy link
Collaborator Author

If the state is Cancelled and no time has elapsed, we should consider resubmitting the job with a limit as to how many times this is attempted

@pagrubel
Copy link
Collaborator Author

pagrubel commented Oct 1, 2024

We are going to talk to the slurm experts about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants