Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run-time errors associated with HR3 UGWPv1 non-stationary GWD code #136

Closed
mdtoyNOAA opened this issue Dec 8, 2023 · 72 comments · Fixed by #138 or NOAA-EMC/fv3atm#735
Closed

Run-time errors associated with HR3 UGWPv1 non-stationary GWD code #136

mdtoyNOAA opened this issue Dec 8, 2023 · 72 comments · Fixed by #138 or NOAA-EMC/fv3atm#735
Labels
bug Something isn't working

Comments

@mdtoyNOAA
Copy link
Collaborator

Description

Cold-start, atmosphere-only 'ufs_model.x' runs experience random run-time crashes in module FV3/ccpp/physics/physics/cires_tauamf_data.F90. This occurs at various global resolutions -- so far, C384, C768 and C1152. Often, the model runs successfully, but when crashes occur they usually occur at the first physics time step.

Steps to Reproduce

I created two "sandbox" directories on Hera:

  1. C768 resolution: /scratch1/BMC/gsd-fv3-dev/NCEPDEV/stmp3/Michael.Toy/RUNDIRS/GFS_v17_test.C768/fcst.125234.sandbox
  2. C1152 resolution: /scratch1/BMC/gsd-fv3-dev/NCEPDEV/stmp3/Michael.Toy/RUNDIRS/GFS_v17_test.C1152/fcst.15481.sandbox

At each resolution, I had runs that were successful and those that crashed. The output logs in each are:
Crashed runs: run_ufs.out.crash
Non-crashed runs: run_ufs.out.no_crash
In each case, the output of the successful run is in the directory: output.no_crash

The model is run by executing: 'sbatch run_ufs.sh'

Additional Context

  • Machine: Hera, Orion, WCOSS2
  • Compiler: Intel
  • Suite Definition File or Scheme: FV3_GFS_v17_p8_ugwpv1

Output

This is error output from an executable compiled in 'debug' mode with the following options:
-check bounds -check uninit -fpe0 -fno-alias -ftrapuv -traceback

4104: forrtl: severe (194): Run-Time Check Failure. The variable 'cires_tauamf_data_mp_tau_amf_interp_$IT2' is being used in '/scratch1/BMC/wrfruc/mtoy/git_local/global-workflow.test/sorc/ufs_model.fd/FV3/ccpp/physics/physics/cires_tauamf_data.F90(144,3)' without being defined
4104: Image PC Routine Line Source
4104: ufs_model.x 0000000008C16589 cires_tauamf_data 144 cires_tauamf_data.F90
4104: ufs_model.x 0000000008A67B08 gfs_phys_time_var 951 GFS_phys_time_vary.fv3.F90
4104: libiomp5.so 00002B31134A6BB3 __kmp_invoke_micr Unknown Unknown
4104: libiomp5.so 00002B3113422903 Unknown Unknown Unknown
4104: libiomp5.so 00002B3113421912 Unknown Unknown Unknown
4104: libiomp5.so 00002B31134A783C Unknown Unknown Unknown
4104: libpthread-2.17.s 00002B3115B8CEA5 Unknown Unknown Unknown
4104: libc-2.17.so 00002B3115E9FB0D clone Unknown Unknown

And here is error output from an executable compiled in 'debug' mode with the following options:
-check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays

708: ==== backtrace (tid: 47964) ====
708: 0 0x000000000004d455 ucs_debug_print_backtrace() ???:0
708: 1 0x00000000036e3a7a cires_tauamf_data_mp_tau_amf_interp_() /scratch1/BMC/wrfruc/mtoy/git_local/global-workflow.test/sorc/ufs_model.fd/FV3/ccpp/physics/physics/cires_tauamf_data.F90:153
708: 2 0x000000000368016a L_gfs_phys_time_vary_mp_gfs_phys_time_vary_timestep_init__865__par_region0_2_0() /scratch1/BMC/wrfruc/mtoy/git_local/global-workflow.test/sorc/ufs_model.fd/FV3/ccpp/physics/physics/GFS_phys_time_vary.fv3.F90:951
708: 3 0x000000000013fbb3 __kmp_invoke_microtask() ???:0
708: 4 0x00000000000bb903 __kmp_invoke_task_func() /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:7813
708: 5 0x00000000000ba912 __kmp_launch_thread() /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:6236
708: 6 0x000000000014083c _INTERNALa0ac8784::__kmp_launch_worker() /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/z_Linux_util.cpp:533
708: 7 0x0000000000007ea5 start_thread() pthread_create.c:0
708: 8 0x00000000000feb0d __clone() ???:0
708: =================================
708: forrtl: severe (174): SIGSEGV, segmentation fault occurred
708: Image PC Routine Line Source
708: ufs_model.x 0000000005D4901A Unknown Unknown Unknown
708: libpthread-2.17.s 00002B0960ECC630 Unknown Unknown Unknown
708: ufs_model.x 00000000036E3A7A cires_tauamf_data 153 cires_tauamf_data.F90
708: ufs_model.x 000000000368016A gfs_phys_time_var 951 GFS_phys_time_vary.fv3.F90
708: libiomp5.so 00002B095E7DEBB3 __kmp_invoke_micr Unknown Unknown
708: libiomp5.so 00002B095E75A903 Unknown Unknown Unknown
708: libiomp5.so 00002B095E759912 Unknown Unknown Unknown
708: libiomp5.so 00002B095E7DF83C Unknown Unknown Unknown
708: libpthread-2.17.s 00002B0960EC4EA5 Unknown Unknown Unknown
708: libc-2.17.so 00002B09611D7B0D clone Unknown Unknown

@mdtoyNOAA mdtoyNOAA added the bug Something isn't working label Dec 8, 2023
@SamuelTrahanNOAA
Copy link
Collaborator

This is happening because you never initialize it2. The error is in cires_tauamf_data.F90 in subroutine tau_amf_interp

This loop is supposed to initialize it2, but it won't if the .lt. comparison fails for all elements of days_limb

            it1 = 2
         do iday=1, ntau_d2t
            if (fddd .lt. days_limb(iday) ) then
            it2 = iday
            exit
            endif
         enddo

When you get to this line, you're taking the minimum of an uninitialized value and ntau_d2t:

         it2 = min(it2,ntau_d2t)

In debug mode, you'll get the runtime abort. Without debug mode, the result will be unpredictable.

@mdtoyNOAA
Copy link
Collaborator Author

Yes, but ntau_d2t is supposed to be 14 (take my word for it) -- it takes on that value on model initialization in subroutine 'read_tau_amf'. With print statements, we've been able to determine that 'ntau_d2t' is occasionally being corrupted and takes on a large negative value. When that happens, then yes, the loop doesn't execute and it2 never gets initialized. I'm trying to figure out why 'ntau_d2t' becomes corrupted -- I thought adding a 'save' statement on its declaration on line 9 of cires_tauamf_data.F90 would prevent this, but it doesn't. Somewhere in the program there seems to be an out of bounds memory reference that overrides the address of 'ntau_d2t'.

@SamuelTrahanNOAA
Copy link
Collaborator

Get rid of it and use size(tau_limb,2) instead.

@mdtoyNOAA
Copy link
Collaborator Author

I'll try this out. I'll remove the declaration of ntau_d1y and ntau_d2t from the global part of the module and make them local variables within each subroutine setting ntau_d1y = size(tau_limb,1) and ntau_d2t = size(tau_limb,2). I'll report back whether this works.

If this prevents crashes, that will be great. But I'd like to know why the original coding would not work, i.e., with ntau_d1y and ntau_d2t declared globally in the module with 'save' statements.

@mdtoyNOAA
Copy link
Collaborator Author

Using size(tau_limb,2) [and size(tau_limb,1) as a substitute for 'ntau_d1y' did not solve the problem. The crashes still occur with the error:
4132: forrtl: severe (194): Run-Time Check Failure. The variable 'cires_tauamf_data_mp_tau_amf_interp_$IT2' is being used in '/scratch1/BMC/wrfruc/mtoy/git_local/global-workflow.test/sorc/ufs_model.fd/FV3/ccpp/physics/physics/cires_tauamf_data.F90(143,3)' without being defined
4132: Image PC Routine Line Source
4132: ufs_model.x 0000000008C167D8 cires_tauamf_data 143 cires_tauamf_data.F90
4132: ufs_model.x 0000000008A67B08 gfs_phys_time_var 951 GFS_phys_time_vary.fv3.F90
4132: libiomp5.so 00002AEF66756BB3 __kmp_invoke_micr Unknown Unknown
4132: libiomp5.so 00002AEF666D2903 Unknown Unknown Unknown
4132: libiomp5.so 00002AEF666D1912 Unknown Unknown Unknown
4132: libiomp5.so 00002AEF6675783C Unknown Unknown Unknown
4132: libpthread-2.17.s 00002AEF68E3CEA5 Unknown Unknown Unknown
4132: libc-2.17.so 00002AEF6914FB0D clone Unknown Unknown

Note that since I've changed code around a bit, line 143 of cires_tauamf_data.F90 corresponds to line 138 of the code in the repo. A theory, perhaps, is that the memory references for the array 'tau_limb' are being overridden somewhere else in the model.

Again, I want to mention that eventually the model will run successfully with successive executions. These crashes occur randomly, but always on the same line of code.

@SamuelTrahanNOAA
Copy link
Collaborator

What happens if you initialize it2?

I'd expect random failures from an uninitialized variable.

@mdtoyNOAA
Copy link
Collaborator Author

What happens if you initialize it2?

I'd expect random failures from an uninitialized variable.

I'll check and get back with you.

@mdtoyNOAA
Copy link
Collaborator Author

@SamuelTrahanNOAA When I initialized 'it2' the model still crashed on random processors. (Again, there were some runs that experienced no crashes). The data that's being processed by every processor (and thread) are identical as these variables are only related to the model time.

The crashes in this experiment are occurring at a different line of code due to a division by zero:
3876: [h15c35:303754:0:303955] Caught signal 8 (Floating point exception: floating-point divide by zero)
.
.
.
3876: forrtl: error (75): floating point exception
3876: Image PC Routine Line Source
3876: ufs_model.x 0000000010F73F5B Unknown Unknown Unknown
3876: libpthread-2.17.s 00002B14738C9630 Unknown Unknown Unknown
3876: ufs_model.x 0000000008C16F81 cires_tauamf_data 156 cires_tauamf_data.F90
3876: ufs_model.x 0000000008A67B08 gfs_phys_time_var 951 GFS_phys_time_vary.fv3.F90
3876: libiomp5.so 00002B14711DBBB3 __kmp_invoke_micr Unknown Unknown

Here is the cires_tauamf_data.F90 code in the vicinity of line 156:
138 it1 = 2
139 it2 = 1 (THIS IS THE LINE OF CODE I ADDED TO INITIALIZE 'IT2' AS YOU SUGGESTED)
140 do iday=1, ntau_d2t
141 if (fddd .lt. days_limb(iday) ) then
142 it2 = iday
143 exit
144 endif
145 enddo
146
147 it2 = min(it2,ntau_d2t)
148 it1 = max(it2-1,1)
149 if (it2 > ntau_d2t ) then
150 print *, ' Error in time-interpolation for tau_amf_interp '
151 print *, ' it1, it2, ntau_d2t ', it1, it2, ntau_d2t
152 print *, ' Error in time-interpolation see cires_tauamf_data.F90 '
153 stop
154 endif
155
156 w2 = (fddd-days_limb(it1))/(days_limb(it2)-days_limb(it1))
157 w1 = 1.0-w2

The only way for a division by zero to occur in line 156 is for 'it2' to be equal to 'it1'. There is no way for this to happen if the array 'days_limb' in line 141 is not corrupted. 'days_limb' is supposed to have the values (-15, 15, 45, 75, 105, 135, 165, 195, 225, 255, 285, 315, 345, 375). So 'days_limb(1)=-15' on the first execution of the loop. 'fdd', which is the day of the year is never less than 1, so the if statement on line 141 should always be .false. on the first loop. The soonest would be the second loop, and it2 would have the value 2, which would lead to it1 .ne. it2.

Again, I think the arrays 'days_limb' and 'tau_limb' are becoming corrupted when subroutine 'tau_amf_interp' comes back in to scope. I just haven't found a way to detect it because when I put in print statements, doing so seems to prevent the corruption -- something to do with the shifting of memory references.

@mdtoyNOAA
Copy link
Collaborator Author

For what it's worth, I've observed that there have never been any crashes in the subroutine 'cires_indx_ugwp' (in the cires_tauamf_data module) associated with possible corruption of the array 'ugwp_taulat', which comes into scope after the calling of subroutine 'read_tau_amf', where it is defined. One difference between subroutines 'cires_indx_ugwp' and 'tau_amf_interp' (where we get corruption of the similar variables 'days_limb' and 'tau_limb') is that the former subroutine is called only once on initialization from the stack associated with "fcst_initialize -> atmos_model_init -> ccpp_physics_init -> FV3_GFS_v17_p8_ugwpv1_time_vary_init_cap" versus the stack that calls the latter, i.e., "fcst_run_phase_1 -> update_atmos_radiation_physics -> ccpp_physic_timestep_init -> FV3_GFS_v17_p8_ugwpv1_time_vary_tsinit_cap". Perhaps in the latter stack, the 'days_limb' and 'tau_limb' memories get stepped on somehow. Hoping this offers some sort of clue.

@SamuelTrahanNOAA
Copy link
Collaborator

Have you tried running with the head of ufs-community develop?

I see no ufs.configure in your directory, so I know that directory used an older version.

@mdtoyNOAA
Copy link
Collaborator Author

Have you tried running with the head of ufs-community develop?

I see no ufs.configure in your directory, so I know that directory used an older version.

I don't believe I have. Can you be specific about which repo you're referring to? I assume you mean 'ufs-community/ufs-weather-model'. Which directory will 'ufs.configure' be in?

@SamuelTrahanNOAA
Copy link
Collaborator

git clone --recursive https://github.com/ufs-community/ufs-weather-model

In your run directory, files with "nems" in their name should have "ufs" in their name instead.

@mdtoyNOAA
Copy link
Collaborator Author

I cloned the latest ufs-weather-model and global-workflow and experienced the same crash as before. Log file is on Hera at:
/scratch1/BMC/wrfruc/mtoy/git_local/global-workflow.Dec_2023/FV3GFSrun/GFS_v17_test.C1152/logs/2015010100/gfsfcst.log

@SamuelTrahanNOAA
Copy link
Collaborator

When I run with the GNU compiler, I get a different error:

4144: --------------------------------------------------------------------------
4144: A process failed to create a queue pair. This usually means either
4144: the device has run out of queue pairs (too many connections) or
4144: there are insufficient resources available to allocate a queue pair
4144: (out of memory). The latter can happen if either 1) insufficient
4144: memory is available, or 2) no more physical memory can be registered
4144: with the device.
4144:
4144: For more information on memory registration see the Open MPI FAQs at:
4144: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
4144:
4144: Local host:             h16c24
4144: Local device:           mlx5_0
4144: Queue pair type:        Reliable connected (RC)
4144: --------------------------------------------------------------------------

@mdtoyNOAA
Copy link
Collaborator Author

This is something I suspected early on -- that it is a memory issue. What Fortran code modifications can fix this problem?

@SamuelTrahanNOAA
Copy link
Collaborator

This is something I suspected early on -- that it is a memory issue. What Fortran code modifications can fix this problem?

I don't think it's a memory issue. A C768 will run with much fewer MPI ranks. You're exhausting some other resource. You're pushing the limits of what the machine was designed for when you running 6000+ MPI ranks with 40 ranks per node. Using ESMF threading doesn't change that; the ranks are still there and allocating resources. ESMF ignores most of them, but they're still there, using resources.

This combination has worked so far. I'm backing out one change at a time:

  1. Use the latest code
  2. Compile with -DAPP=ATM
  3. Lots of bug fixes
  4. Disable ESMF threading
  5. Reduce the number of MPI ranks

Doing all five of those gets the C768 to run with the Intel compiler. Running with the GNU compiler also requires fixing a syntax error in your input.nml

This:

ccpp_suite = FV3_GFS_v17_p8_ugwpv1

Should be

ccpp_suite = "FV3_GFS_v17_p8_ugwpv1"

@SamuelTrahanNOAA
Copy link
Collaborator

Also, I'm certain that there is heap corruption. I reorganized the code so that some of the arrays were not accessible outside the module. One of them was still allocated with the appropriate size, but the code segfaulted when accessing it.

@mdtoyNOAA
Copy link
Collaborator Author

@SamuelTrahanNOAA It looks like you're getting close to a solution. I wanted to pass on a comment that Judy Henderson had about the array read in subroutine 'read_tau_amf':
"Arrays should be read by the master MPI task only and then broadcast to the other MPI tasks. This probably won't cause trouble but can be inefficient when there are a large number of tasks." Is this something we should look at changing?

@SamuelTrahanNOAA
Copy link
Collaborator

It looks like you're getting close to a solution. I wanted to pass on a comment that Judy Henderson had about the array read in subroutine 'read_tau_amf':
"Arrays should be read by the master MPI task only and then broadcast to the other MPI tasks. This probably won't cause trouble but can be inefficient when there are a large number of tasks." Is this something we should look at changing?

You would have to move the read calls out of CCPP and put them in the model.

@mdtoyNOAA
Copy link
Collaborator Author

It looks like you're getting close to a solution. I wanted to pass on a comment that Judy Henderson had about the array read in subroutine 'read_tau_amf':
"Arrays should be read by the master MPI task only and then broadcast to the other MPI tasks. This probably won't cause trouble but can be inefficient when there are a large number of tasks." Is this something we should look at changing?

You would have to move the read calls out of CCPP and put them in the model.

Okay, never mind. I guess it doesn't hurt as this read call happens only once on model initialization.

@mdtoyNOAA
Copy link
Collaborator Author

After the dust has settled, would you recommend our changing the 'cires_tauamf_data.F90' code as follows:

@@ -6,9 +6,9 @@ module cires_tauamf_data
!...........................................................................................
implicit none

  • integer :: ntau_d1y, ntau_d2t
  • real(kind=kind_phys), allocatable :: ugwp_taulat(:)
  • real(kind=kind_phys), allocatable :: tau_limb(:,:), days_limb(:)
  • integer, save :: ntau_d1y, ntau_d2t
  • real(kind=kind_phys), allocatable, save :: ugwp_taulat(:)
  • real(kind=kind_phys), allocatable, save :: tau_limb(:,:), days_limb(:)
    logical :: flag_alloctau = .false.
    character(len=255):: ugwp_taufile = 'ugwp_limb_tau.nc'

Basically, we would be adding 'save' statements to the variable declarations so that their values can be retrieved on each subroutine call.

@SamuelTrahanNOAA
Copy link
Collaborator

You don't need to change that. Module-level data is implicitly "save," so adding "save" has no effect.

I'm able to run with the head of ufs-weather-model develop. I've narrowed it down to:

  1. Use the latest code
  2. Disable ESMF threading
  3. Fewer MPI ranks.

I don't need any bug fixes to run the C768 case.

@yangfanglin
Copy link
Collaborator

Sam, can you explain a bit more why ESMF threading is causing trouble in this case?

In the past before ESMF threading became available, we had always been running the model at C768 with 4 threads (10 tasks per node) on HERA. Is the ESMF threading forcing the model to use 40 tasks per node on Hera ? (@junwang-noaa )

Thanks.

@SamuelTrahanNOAA
Copy link
Collaborator

Sam, can you explain a bit more why ESMF threading is causing trouble in this case?

I'm not certain ESMF threading is causing this to fail. I don't see how ESMF threading could cause memory heap corruption. All I know for certain is that there's memory heap corruption.

@SamuelTrahanNOAA
Copy link
Collaborator

Is the ESMF threading forcing the model to use 40 tasks per node on Hera ?

Your job card requests 40 tasks per node. ESMF threading works by fully committing the node, and then using only some of the MPI ranks. With 4 ESMF threads, you have 40 MPI ranks per node, but only 10 PETs per node. Thirty ranks per node are allocated and inactive.

I don't see how that could corrupt the memory heap.

@mdtoyNOAA
Copy link
Collaborator Author

You don't need to change that. Module-level data is implicitly "save," so adding "save" has no effect.

I'm able to run with the head of ufs-weather-model develop. I've narrowed it down to:

  1. Use the latest code
  2. Disable ESMF threading
  3. Fewer MPI ranks.

I don't need any bug fixes to run the C768 case.

So, would you say that the UGWPv1 module 'cires_tauamf_data' is not the cause of the memory heap corruption? If not, why did the model always crash within the module? Do you know if the corruption happens when running the 'FV3_GFS_v17_p8' suite instead of the 'FV3_GFS_v17_p8_ugwpv1' suite?

@SamuelTrahanNOAA
Copy link
Collaborator

So, would you say that the UGWPv1 module 'cires_tauamf_data' is not the cause of the memory heap corruption? If not, why did the model always crash within the module? Do you know if the corruption happens when running the 'FV3_GFS_v17_p8' suite instead of the 'FV3_GFS_v17_p8_ugwpv1' suite?

There are some problems with your error detection and array bounds... but no out-of-bounds writes. You have to write out of bounds to corrupt memory.

https://github.com/SamuelTrahanNOAA/ufs-weather-model/tree/gwd-clm-lake-fixes

That's a branch I'm working on with some bug fixes. Mostly it's for the clm lake, but I tried some changes to your code. Lots of extra paranoia and error detection, to prevent problems and detect problems.

Somehow, an array has an invalid pointer eventually. I'm not sure why yet.

@mdtoyNOAA
Copy link
Collaborator Author

mdtoyNOAA commented Dec 14, 2023 via email

@SamuelTrahanNOAA
Copy link
Collaborator

It looks like ESMF disables threads unless you ask ESMF to handle the threading for you. When I do a 4000 rank case with ESMF threads, it is twice as fast as without ESMF threads. That is consistent with what you'd expect when ESMF disables threading.

Does anyone know how to tell ESMF to get out of the way and let the program dictate the threading?

What we're seeing is two bugs:

  1. When ESMF threading is not in use, ESMF disables OpenMP threads.
  2. When threading is in use, the model fails in cires_tauamf_data.F90.

In other words: ESMF threading was a red herring, and something else is causing the failure.

@SamuelTrahanNOAA
Copy link
Collaborator

[Restating this in its own comment for clarity.]

ESMF threading is not causing the problem. It turns out that ESMF disables threading unless you tell ESMF to handle the threading in ufs.configure. My "threaded" cases without ESMF were not threaded because ESMF reduced the number of threads to 1.

Threading is causing the problem. That is what has been consistent with my tests.

@lisa-bengtsson
Copy link
Collaborator

If it segfaults when threading is used, how did it pass the RT's and the ORT tests? Are these random crashes new from a known PR?

@SamuelTrahanNOAA
Copy link
Collaborator

If it segfaults when threading is used, how did it pass the RT's and the ORT tests? Are these random crashes new from a known PR?

Perhaps it only happens when you use a large number of ranks? The regression tests are quite small.

@junwang-noaa
Copy link

It looks like ESMF disables threads unless you ask ESMF to handle the threading for you. When I do a 4000 rank case with ESMF threads, it is twice as fast as without ESMF threads. That is consistent with what you'd expect when ESMF disables threading.

Does anyone know how to tell ESMF to get out of the way and let the program dictate the threading?

What we're seeing is two bugs:

  1. When ESMF threading is not in use, ESMF disables OpenMP threads.
  2. When threading is in use, the model fails in cires_tauamf_data.F90.

In other words: ESMF threading was a red herring, and something else is causing the failure.

@SamuelTrahanNOAA Can you explain how you get to this "When ESMF threading is not in use, ESMF disables OpenMP threads"?

To disable ESMF managed threading, you need to:
in ufs_configure:

  1. set: globalResourceControl: false
  2. remove component threading settings: $COMP_omp_num_threads
    in job card, set up threading in a traditional way, specify the OMP_NUM_THREADS, ntasks-per-node, etc.

@SamuelTrahanNOAA
Copy link
Collaborator

I did not do this:

  • globalResourceControl: false in ufs.configure

But I did do this:

  • remove component threading settings: $COMP_omp_num_threads

Apparently, if you don't specify the number of threads, then that number is 1.

@SamuelTrahanNOAA
Copy link
Collaborator

Can someone tell me how to do a mix of thread counts within one component? This is not working:

/scratch1/BMC/zrtrr/Samuel.Trahan/lakes/esmf-mix/

I want:

  • 5400 ranks with 1 thread per rank for compute
  • 300 ranks with 2 threads per rank for the write groups

I get this error from many ranks:

FATAL from PE 1124: mpp_domains_define.inc: not all the pe_end are in the pelist

Here are the relevant lines in the files:

job.sh:

#SBATCH --nodes=150
#SBATCH --ntasks=6000
#SBATCH --cpus-per-task=1
srun "$exe"

ufs.configure:

ATM_model:                      fv3
ATM_petlist_bounds:             0 5399
ATM_omp_num_threads:            1
ATM_petlist_bounds:             5400 5999
ATM_omp_num_threads:            2

model_configure:

quilting:                .true.
quilting_restart:        .true.
write_groups:            2
write_tasks_per_group:   150

input.nml:

  layout = 30,30

@junwang-noaa
Copy link

junwang-noaa commented Dec 14, 2023

@SamuelTrahanNOAA that capability is on our list, it is not available yet. Also depending on which platform, if you are running with large number of cores (>6000) with slingshot 10 which has known issues with MPI_ALL2ALL call, there are issues with large number of core with ESMF managed threading, we suggest to move to slingshot 11 or ucx network, which does not have issues. The traditional threading only needs to specify the MPI tasks, which alleviate the problem.

@SamuelTrahanNOAA
Copy link
Collaborator

We can rule out ESMF threads. Now that I know how to turn the off correctly, I ran one case: 4000 ranks with 2 threads. The test failed in the expected spot in cires_tauamf_data.F90. It's always reassuring when causality holds. Something that happens before the physics init should not corrupt the heap of something that happens after the physics init. Lo and behold, with proper testing, we have ruled out causality violations as a cause of the crash.

@SamuelTrahanNOAA
Copy link
Collaborator

Removing all OpenMP directives from GFS_phys_time_vary.fv3.F90 allows the model to run. This is with 4000 MPI ranks and 2 threads per rank.

I believe this is the problem:

The C-based libraries are not thread-safe. C-based libraries are those that depend on the C library, which currently include all language interfaces except for the Java interface

We're calling a non-thread-safe library in a threaded region. There are multiple NetCDF calls going concurrently.

I'm setting up another test where I remove the OpenMP directives around only the calls to read data in GFS_phys_time_vary.fv3.F90. If our luck holds, that will be enough to fix the problem.

@SamuelTrahanNOAA
Copy link
Collaborator

With a bug fix, a big threaded test (4000 rank, 2 thread per rank) passed four out of four tries. The fix is to disable threading in NetCDF-reading sections of GFS_phys_time_vary.fv3.F90.

It is in this branch:

https://github.com/SamuelTrahanNOAA/ufs-weather-model/tree/init-concurrency-bug

Which would be:

git clone --recursive --branch init-concurrency-bug https://github.com/SamuelTrahanNOAA/ufs-weather-model

The only change is in FV3/ccpp/physics/physics/GFS_phys_time_vary.fv3.F90

@mdtoyNOAA
Copy link
Collaborator Author

As another data point, I should mention that I was also experiencing the crashes with 1 thread.

Can you provide a test case for that?

@SamuelTrahanNOAA I was unable to reproduce the crash at C384 for with 1 thread. In case you're interested, the test case is on Hera at:
/scratch1/BMC/gsd-fv3-dev/NCEPDEV/stmp3/Michael.Toy/RUNDIRS/GFS_v17_test.C384/fcst.187316.sandbox
Case launched with: sbatch run_ufs.sh

@mdtoyNOAA
Copy link
Collaborator Author

With a bug fix, a big threaded test (4000 rank, 2 thread per rank) passed four out of four tries. The fix is to disable threading in NetCDF-reading sections of GFS_phys_time_vary.fv3.F90.

It is in this branch:

https://github.com/SamuelTrahanNOAA/ufs-weather-model/tree/init-concurrency-bug

Which would be:

git clone --recursive --branch init-concurrency-bug https://github.com/SamuelTrahanNOAA/ufs-weather-model

The only change is in FV3/ccpp/physics/physics/GFS_phys_time_vary.fv3.F90

Thank you @SamuelTrahanNOAA I'll test it out.

@SamuelTrahanNOAA
Copy link
Collaborator

I was unable to reproduce the crash at C384 for with 1 thread. In case you're interested, the test case is on Hera at:

In my limited tests, keeping the number of nodes fixed, 2 threads is faster than 1 or 4. This was with 200 nodes. Scaling may vary with node count.

@SamuelTrahanNOAA
Copy link
Collaborator

If it segfaults when threading is used, how did it pass the RT's and the ORT tests? Are these random crashes new from a known PR?

I have a better answer to this now. It's simple probability.

For each rank, you have some probability of getting a failure. Two NetCDF calls have to overlap in the wrong way. The more ranks you have, the higher the probability is that at least one will fail.

@SamuelTrahanNOAA
Copy link
Collaborator

I've opened pull requests to fix this:

The only changes are in ccpp-physics. Others are submodule hash updates, .gitmodules, and regression test logs.

@lisa-bengtsson
Copy link
Collaborator

Nice detective work Sam. @grantfirl it would be nice if the bugfix PR could have some priority since it will be needed for e.g. HR3.

@grantfirl
Copy link
Collaborator

Nice detective work Sam. @grantfirl it would be nice if the bugfix PR could have some priority since it will be needed for e.g. HR3.

@lisa-bengtsson OK, since the fix doesn't require new baselines, it should be able to be combined with another PR and go in quickly. If combination is not possible, I'll suggest that it be the first CCPP PR in the queue after #119 that is being tested for final merge now.

@mdtoyNOAA
Copy link
Collaborator Author

My 14-cycle C1152 test on Hera passed with the new code. So all looks good. @SamuelTrahanNOAA Thank you very much for figuring out the fix. Also thank you to Judy Henderson, Tom Henderson and George Vanderberghe for their testing and discovery of many clues that helped get to the problem.

@yangfanglin
Copy link
Collaborator

@mdtoyNOAA did you notice any extra cost of runtime after removing the threading ?

@mdtoyNOAA
Copy link
Collaborator Author

@mdtoyNOAA did you notice any extra cost of runtime after removing the threading ?

I did not do any timing tests, but anyway, I didn't change anything in my settings, so I was still specifying 4 threads (and I assume they're still ESMF threads). I'll admit that my eyes glazed over during the discussion of ESMF threads, etc. @SamuelTrahanNOAA Is the final outcome that nothing has changed with the threading?

@yangfanglin
Copy link
Collaborator

Sam removed OpenMP in the section of code

@SamuelTrahanNOAA
Copy link
Collaborator

Sam removed OpenMP in the section of code

I only removed OpenMP from the code that was reading data. Other OpenMP directives in the file are still there. Any computation in that file is still parallelized.

A faster way to read those files is to do it at the model level.

Is the final outcome that nothing has changed with the threading?

You should not use four threads. Your configuration doesn't scale well with threading on Hera. Use 2 threads instead, and increase the number of PETs. You'll get a faster simulation with the same number of nodes (or even fewer).

@mdtoyNOAA
Copy link
Collaborator Author

Sam removed OpenMP in the section of code

I only removed OpenMP from the code that was reading data. Other OpenMP directives in the file are still there. Any computation in that file is still parallelized.

A faster way to read those files is to do it at the model level.

Is the final outcome that nothing has changed with the threading?

You should not use four threads. Your configuration doesn't scale well with threading on Hera. Use 2 threads instead, and increase the number of PETs. You'll get a faster simulation with the same number of nodes (or even fewer).

Sounds good. Thanks.

@mdtoyNOAA
Copy link
Collaborator Author

Quick question: Do we need to make a change to 'GFS_phys_time_vary.scm.F90' similar to the one @SamuelTrahanNOAA made to 'GFS_phys_time_vary.fv3.F90'. I'm guessing not because I didn't see corresponding OMP lines of code in the 'scm' module.

@grantfirl
Copy link
Collaborator

Quick question: Do we need to make a change to 'GFS_phys_time_vary.scm.F90' similar to the one @SamuelTrahanNOAA made to 'GFS_phys_time_vary.fv3.F90'. I'm guessing not because I didn't see corresponding OMP lines of code in the 'scm' module.

No for the reason you mentioned.

@SamuelTrahanNOAA
Copy link
Collaborator

Quick question: Do we need to make a change to 'GFS_phys_time_vary.scm.F90' similar to the one @SamuelTrahanNOAA made to 'GFS_phys_time_vary.fv3.F90'. I'm guessing not because I didn't see corresponding OMP lines of code in the 'scm' module.

NetCDF is not thread-safe. Any code that may call NetCDF from two threads at a time needs to be fixed. It doesn't matter if it's CCPP or a graphics package.

If the code does not use threading, then there is no worry.

There's other forms of threading besides OpenMP. We don't use them in NWP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
6 participants