Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Betzy problem when running on 2 nodes with noresm2.5 #594

Open
mvertens opened this issue Nov 21, 2024 · 7 comments
Open

Betzy problem when running on 2 nodes with noresm2.5 #594

mvertens opened this issue Nov 21, 2024 · 7 comments

Comments

@mvertens
Copy link

I have now encountered the same issue when running I compsets and F compsets on 2 nodes. Errors like the following appear in the cesm.log file

134: [b3296.betzy.sigma2.no:1104993] pml_ucx.c:911 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer
134: [b3296:1104993] *** An error occurred in MPI_Send
134: [b3296:1104993] *** reported by process [23299297509376,134]
134: [b3296:1104993] *** on communicator MPI COMMUNICATOR 21 CREATE FROM 20
134: [b3296:1104993] *** MPI_ERR_OTHER: known error not in list
134: [b3296:1104993] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
134: [b3296:1104993] *** and potentially your MPI job)

The solution seems to be to increase the number of nodes to 4 - and then everything works. I am writing to sigma2 to raise this issue as well.

@TomasTorsvik
Copy link
Contributor

TomasTorsvik commented Nov 22, 2024

@mvertens - Hi, Betzy has a minimum node number of 4 for jobs on the "normal" queue. "devel" jobs can run on 1-4 nodes for a short time (up to 60 min). It seems this requirement is there to encourage moving smaller jobs to Fram, so that these do not fill up the queue on Betzy.

See job types description here:
https://documentation.sigma2.no/jobs/job_types/betzy_job_types.html

@gold2718
Copy link

The NorESM configuration for Betzy currently only sends jobs with 4 or more nodes to the normal queue. The devel queue is marked with a minimum of 1 and a maximum of 4 nodes and the preproc queue has no restrictions.
See <ccs_config>/machines/betzy/config_batch.xml.

@mvertens, which queue was used for your job? Even if it was devel (which I think should have worked), I think we should restrict preproc to 1 node as that is a the Betzy limit.

@JensBDebernard
Copy link
Contributor

Thanks Mariana. I have experienced the same error with the MakingWave code during the last week.
wave-ocean-ice with data atmos was working find on 2 nodes (develop queue) from the start of November. But suddenly, around November 13-15th something has changed so none of these compsets (or perturbations thereof are running) with the same pe-layout. I will try increasing the number of nodes, although it is more costly in the debug-phase.

@mvertens
Copy link
Author

@TomasTorsvik @gold2718 @JensBDebernard - I have double checked and the queue is devel. This also worked for me up until around the 15th and suddenly stopped working. I have raised an issue with sigma2.

@gold2718
Copy link

Should we take the hint and set up a test suite of smaller tests on Fram? It would just mean firing off and then checking two test runs instead of one.

@mvertens
Copy link
Author

I got a response from sigma2 that they have escalated this ticket to their second line support, and they'll follow up shortly.

@TomasTorsvik
Copy link
Contributor

Our quota on Fram is quite limited, only 150K CPU hours on nn2345k. We could ask for an increased quota, but it would probably be "non-prioritized" for the current allocation period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants