Betzy problem when running on 2 nodes with noresm2.5 #594

mvertens · 2024-11-21T18:10:46Z

I have now encountered the same issue when running I compsets and F compsets on 2 nodes. Errors like the following appear in the cesm.log file

134: [b3296.betzy.sigma2.no:1104993] pml_ucx.c:911 Error: mca_pml_ucx_send_nbr failed: -25, Connection reset by remote peer
134: [b3296:1104993] *** An error occurred in MPI_Send
134: [b3296:1104993] *** reported by process [23299297509376,134]
134: [b3296:1104993] *** on communicator MPI COMMUNICATOR 21 CREATE FROM 20
134: [b3296:1104993] *** MPI_ERR_OTHER: known error not in list
134: [b3296:1104993] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
134: [b3296:1104993] *** and potentially your MPI job)

The solution seems to be to increase the number of nodes to 4 - and then everything works. I am writing to sigma2 to raise this issue as well.

TomasTorsvik · 2024-11-22T06:55:01Z

@mvertens - Hi, Betzy has a minimum node number of 4 for jobs on the "normal" queue. "devel" jobs can run on 1-4 nodes for a short time (up to 60 min). It seems this requirement is there to encourage moving smaller jobs to Fram, so that these do not fill up the queue on Betzy.

See job types description here:
https://documentation.sigma2.no/jobs/job_types/betzy_job_types.html

gold2718 · 2024-11-22T08:35:22Z

The NorESM configuration for Betzy currently only sends jobs with 4 or more nodes to the normal queue. The devel queue is marked with a minimum of 1 and a maximum of 4 nodes and the preproc queue has no restrictions.
See <ccs_config>/machines/betzy/config_batch.xml.

@mvertens, which queue was used for your job? Even if it was devel (which I think should have worked), I think we should restrict preproc to 1 node as that is a the Betzy limit.

JensBDebernard · 2024-11-22T08:37:58Z

Thanks Mariana. I have experienced the same error with the MakingWave code during the last week.
wave-ocean-ice with data atmos was working find on 2 nodes (develop queue) from the start of November. But suddenly, around November 13-15th something has changed so none of these compsets (or perturbations thereof are running) with the same pe-layout. I will try increasing the number of nodes, although it is more costly in the debug-phase.

mvertens · 2024-11-22T08:40:14Z

@TomasTorsvik @gold2718 @JensBDebernard - I have double checked and the queue is devel. This also worked for me up until around the 15th and suddenly stopped working. I have raised an issue with sigma2.

gold2718 · 2024-11-22T08:41:07Z

Should we take the hint and set up a test suite of smaller tests on Fram? It would just mean firing off and then checking two test runs instead of one.

mvertens · 2024-11-22T08:50:46Z

I got a response from sigma2 that they have escalated this ticket to their second line support, and they'll follow up shortly.

TomasTorsvik · 2024-11-22T09:20:59Z

Our quota on Fram is quite limited, only 150K CPU hours on nn2345k. We could ask for an increased quota, but it would probably be "non-prioritized" for the current allocation period.

github-project-automation bot added this to NorESM Development Nov 21, 2024

mvertens added the betzy issue label Nov 21, 2024

github-project-automation bot moved this to Todo in NorESM Development Nov 21, 2024

mvertens removed this from NorESM Development Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Betzy problem when running on 2 nodes with noresm2.5 #594

Betzy problem when running on 2 nodes with noresm2.5 #594

mvertens commented Nov 21, 2024

TomasTorsvik commented Nov 22, 2024 •

edited

Loading

gold2718 commented Nov 22, 2024

JensBDebernard commented Nov 22, 2024

mvertens commented Nov 22, 2024

gold2718 commented Nov 22, 2024

mvertens commented Nov 22, 2024

TomasTorsvik commented Nov 22, 2024

Betzy problem when running on 2 nodes with noresm2.5 #594

Betzy problem when running on 2 nodes with noresm2.5 #594

Comments

mvertens commented Nov 21, 2024

TomasTorsvik commented Nov 22, 2024 • edited Loading

gold2718 commented Nov 22, 2024

JensBDebernard commented Nov 22, 2024

mvertens commented Nov 22, 2024

gold2718 commented Nov 22, 2024

mvertens commented Nov 22, 2024

TomasTorsvik commented Nov 22, 2024

TomasTorsvik commented Nov 22, 2024 •

edited

Loading