-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Betzy problem when running on 2 nodes with noresm2.5 #594
Comments
@mvertens - Hi, Betzy has a minimum node number of 4 for jobs on the "normal" queue. "devel" jobs can run on 1-4 nodes for a short time (up to 60 min). It seems this requirement is there to encourage moving smaller jobs to Fram, so that these do not fill up the queue on Betzy. See job types description here: |
The NorESM configuration for Betzy currently only sends jobs with 4 or more nodes to the normal queue. The devel queue is marked with a minimum of 1 and a maximum of 4 nodes and the preproc queue has no restrictions. @mvertens, which queue was used for your job? Even if it was devel (which I think should have worked), I think we should restrict preproc to 1 node as that is a the Betzy limit. |
Thanks Mariana. I have experienced the same error with the MakingWave code during the last week. |
@TomasTorsvik @gold2718 @JensBDebernard - I have double checked and the queue is devel. This also worked for me up until around the 15th and suddenly stopped working. I have raised an issue with sigma2. |
Should we take the hint and set up a test suite of smaller tests on Fram? It would just mean firing off and then checking two test runs instead of one. |
I got a response from sigma2 that they have escalated this ticket to their second line support, and they'll follow up shortly. |
Our quota on Fram is quite limited, only 150K CPU hours on nn2345k. We could ask for an increased quota, but it would probably be "non-prioritized" for the current allocation period. |
I have now encountered the same issue when running I compsets and F compsets on 2 nodes. Errors like the following appear in the cesm.log file
134: [b3296.betzy.sigma2.no:1104993] pml_ucx.c:911 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer
134: [b3296:1104993] *** An error occurred in MPI_Send
134: [b3296:1104993] *** reported by process [23299297509376,134]
134: [b3296:1104993] *** on communicator MPI COMMUNICATOR 21 CREATE FROM 20
134: [b3296:1104993] *** MPI_ERR_OTHER: known error not in list
134: [b3296:1104993] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
134: [b3296:1104993] *** and potentially your MPI job)
The solution seems to be to increase the number of nodes to 4 - and then everything works. I am writing to sigma2 to raise this issue as well.
The text was updated successfully, but these errors were encountered: