-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run-time errors associated with HR3 UGWPv1 non-stationary GWD code #136
Comments
This is happening because you never initialize it2. The error is in cires_tauamf_data.F90 in subroutine tau_amf_interp This loop is supposed to initialize it2, but it won't if the it1 = 2
do iday=1, ntau_d2t
if (fddd .lt. days_limb(iday) ) then
it2 = iday
exit
endif
enddo When you get to this line, you're taking the minimum of an uninitialized value and ntau_d2t: it2 = min(it2,ntau_d2t) In debug mode, you'll get the runtime abort. Without debug mode, the result will be unpredictable. |
Yes, but ntau_d2t is supposed to be 14 (take my word for it) -- it takes on that value on model initialization in subroutine 'read_tau_amf'. With print statements, we've been able to determine that 'ntau_d2t' is occasionally being corrupted and takes on a large negative value. When that happens, then yes, the loop doesn't execute and it2 never gets initialized. I'm trying to figure out why 'ntau_d2t' becomes corrupted -- I thought adding a 'save' statement on its declaration on line 9 of cires_tauamf_data.F90 would prevent this, but it doesn't. Somewhere in the program there seems to be an out of bounds memory reference that overrides the address of 'ntau_d2t'. |
Get rid of it and use |
I'll try this out. I'll remove the declaration of ntau_d1y and ntau_d2t from the global part of the module and make them local variables within each subroutine setting ntau_d1y = size(tau_limb,1) and ntau_d2t = size(tau_limb,2). I'll report back whether this works. If this prevents crashes, that will be great. But I'd like to know why the original coding would not work, i.e., with ntau_d1y and ntau_d2t declared globally in the module with 'save' statements. |
Using size(tau_limb,2) [and size(tau_limb,1) as a substitute for 'ntau_d1y' did not solve the problem. The crashes still occur with the error: Note that since I've changed code around a bit, line 143 of cires_tauamf_data.F90 corresponds to line 138 of the code in the repo. A theory, perhaps, is that the memory references for the array 'tau_limb' are being overridden somewhere else in the model. Again, I want to mention that eventually the model will run successfully with successive executions. These crashes occur randomly, but always on the same line of code. |
What happens if you initialize it2? I'd expect random failures from an uninitialized variable. |
I'll check and get back with you. |
@SamuelTrahanNOAA When I initialized 'it2' the model still crashed on random processors. (Again, there were some runs that experienced no crashes). The data that's being processed by every processor (and thread) are identical as these variables are only related to the model time. The crashes in this experiment are occurring at a different line of code due to a division by zero: Here is the cires_tauamf_data.F90 code in the vicinity of line 156: The only way for a division by zero to occur in line 156 is for 'it2' to be equal to 'it1'. There is no way for this to happen if the array 'days_limb' in line 141 is not corrupted. 'days_limb' is supposed to have the values (-15, 15, 45, 75, 105, 135, 165, 195, 225, 255, 285, 315, 345, 375). So 'days_limb(1)=-15' on the first execution of the loop. 'fdd', which is the day of the year is never less than 1, so the if statement on line 141 should always be .false. on the first loop. The soonest would be the second loop, and it2 would have the value 2, which would lead to it1 .ne. it2. Again, I think the arrays 'days_limb' and 'tau_limb' are becoming corrupted when subroutine 'tau_amf_interp' comes back in to scope. I just haven't found a way to detect it because when I put in print statements, doing so seems to prevent the corruption -- something to do with the shifting of memory references. |
For what it's worth, I've observed that there have never been any crashes in the subroutine 'cires_indx_ugwp' (in the cires_tauamf_data module) associated with possible corruption of the array 'ugwp_taulat', which comes into scope after the calling of subroutine 'read_tau_amf', where it is defined. One difference between subroutines 'cires_indx_ugwp' and 'tau_amf_interp' (where we get corruption of the similar variables 'days_limb' and 'tau_limb') is that the former subroutine is called only once on initialization from the stack associated with "fcst_initialize -> atmos_model_init -> ccpp_physics_init -> FV3_GFS_v17_p8_ugwpv1_time_vary_init_cap" versus the stack that calls the latter, i.e., "fcst_run_phase_1 -> update_atmos_radiation_physics -> ccpp_physic_timestep_init -> FV3_GFS_v17_p8_ugwpv1_time_vary_tsinit_cap". Perhaps in the latter stack, the 'days_limb' and 'tau_limb' memories get stepped on somehow. Hoping this offers some sort of clue. |
Have you tried running with the head of ufs-community develop? I see no ufs.configure in your directory, so I know that directory used an older version. |
I don't believe I have. Can you be specific about which repo you're referring to? I assume you mean 'ufs-community/ufs-weather-model'. Which directory will 'ufs.configure' be in? |
git clone --recursive https://github.com/ufs-community/ufs-weather-model In your run directory, files with "nems" in their name should have "ufs" in their name instead. |
I cloned the latest ufs-weather-model and global-workflow and experienced the same crash as before. Log file is on Hera at: |
When I run with the GNU compiler, I get a different error:
|
This is something I suspected early on -- that it is a memory issue. What Fortran code modifications can fix this problem? |
I don't think it's a memory issue. A C768 will run with much fewer MPI ranks. You're exhausting some other resource. You're pushing the limits of what the machine was designed for when you running 6000+ MPI ranks with 40 ranks per node. Using ESMF threading doesn't change that; the ranks are still there and allocating resources. ESMF ignores most of them, but they're still there, using resources. This combination has worked so far. I'm backing out one change at a time:
Doing all five of those gets the C768 to run with the Intel compiler. Running with the GNU compiler also requires fixing a syntax error in your input.nml This:
Should be
|
Also, I'm certain that there is heap corruption. I reorganized the code so that some of the arrays were not accessible outside the module. One of them was still allocated with the appropriate size, but the code segfaulted when accessing it. |
@SamuelTrahanNOAA It looks like you're getting close to a solution. I wanted to pass on a comment that Judy Henderson had about the array read in subroutine 'read_tau_amf': |
You would have to move the read calls out of CCPP and put them in the model. |
Okay, never mind. I guess it doesn't hurt as this read call happens only once on model initialization. |
After the dust has settled, would you recommend our changing the 'cires_tauamf_data.F90' code as follows: @@ -6,9 +6,9 @@ module cires_tauamf_data
Basically, we would be adding 'save' statements to the variable declarations so that their values can be retrieved on each subroutine call. |
You don't need to change that. Module-level data is implicitly "save," so adding "save" has no effect. I'm able to run with the head of ufs-weather-model develop. I've narrowed it down to:
I don't need any bug fixes to run the C768 case. |
Sam, can you explain a bit more why ESMF threading is causing trouble in this case? In the past before ESMF threading became available, we had always been running the model at C768 with 4 threads (10 tasks per node) on HERA. Is the ESMF threading forcing the model to use 40 tasks per node on Hera ? (@junwang-noaa ) Thanks. |
I'm not certain ESMF threading is causing this to fail. I don't see how ESMF threading could cause memory heap corruption. All I know for certain is that there's memory heap corruption. |
Your job card requests 40 tasks per node. ESMF threading works by fully committing the node, and then using only some of the MPI ranks. With 4 ESMF threads, you have 40 MPI ranks per node, but only 10 PETs per node. Thirty ranks per node are allocated and inactive. I don't see how that could corrupt the memory heap. |
So, would you say that the UGWPv1 module 'cires_tauamf_data' is not the cause of the memory heap corruption? If not, why did the model always crash within the module? Do you know if the corruption happens when running the 'FV3_GFS_v17_p8' suite instead of the 'FV3_GFS_v17_p8_ugwpv1' suite? |
There are some problems with your error detection and array bounds... but no out-of-bounds writes. You have to write out of bounds to corrupt memory. https://github.com/SamuelTrahanNOAA/ufs-weather-model/tree/gwd-clm-lake-fixes That's a branch I'm working on with some bug fixes. Mostly it's for the clm lake, but I tried some changes to your code. Lots of extra paranoia and error detection, to prevent problems and detect problems. Somehow, an array has an invalid pointer eventually. I'm not sure why yet. |
As another data point, I should mention that I was also experiencing the crashes with 1 thread.
… On Dec 13, 2023, at 6:27 PM, Samuel Trahan (NOAA contractor) ***@***.***> wrote:
So, would you say that the UGWPv1 module 'cires_tauamf_data' is not the cause of the memory heap corruption? If not, why did the model always crash within the module? Do you know if the corruption happens when running the 'FV3_GFS_v17_p8' suite instead of the 'FV3_GFS_v17_p8_ugwpv1' suite?
There are some problems with your error detection and array bounds... but no out-of-bounds writes. You have to write out of bounds to corrupt memory.
https://github.com/SamuelTrahanNOAA/ufs-weather-model/tree/gwd-clm-lake-fixes <https://github.com/SamuelTrahanNOAA/ufs-weather-model/tree/gwd-clm-lake-fixes>
That's a branch I'm working on with some bug fixes. Mostly it's for the clm lake, but I tried some changes to your code. Lots of extra paranoia and error detection, to prevent problems and detect problems.
Somehow, an array has an invalid pointer eventually. I'm not sure why yet.
—
Reply to this email directly, view it on GitHub <#136 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARRVLIHFVMMDWDGHREHHV63YJJIXLAVCNFSM6AAAAABAM7BQ7GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJUHE2TMOJQGY>.
You are receiving this because you authored the thread.
|
It looks like ESMF disables threads unless you ask ESMF to handle the threading for you. When I do a 4000 rank case with ESMF threads, it is twice as fast as without ESMF threads. That is consistent with what you'd expect when ESMF disables threading. Does anyone know how to tell ESMF to get out of the way and let the program dictate the threading? What we're seeing is two bugs:
In other words: ESMF threading was a red herring, and something else is causing the failure. |
[Restating this in its own comment for clarity.] ESMF threading is not causing the problem. It turns out that ESMF disables threading unless you tell ESMF to handle the threading in ufs.configure. My "threaded" cases without ESMF were not threaded because ESMF reduced the number of threads to 1. Threading is causing the problem. That is what has been consistent with my tests. |
If it segfaults when threading is used, how did it pass the RT's and the ORT tests? Are these random crashes new from a known PR? |
Perhaps it only happens when you use a large number of ranks? The regression tests are quite small. |
@SamuelTrahanNOAA Can you explain how you get to this "When ESMF threading is not in use, ESMF disables OpenMP threads"? To disable ESMF managed threading, you need to:
|
I did not do this:
But I did do this:
Apparently, if you don't specify the number of threads, then that number is 1. |
Can someone tell me how to do a mix of thread counts within one component? This is not working: /scratch1/BMC/zrtrr/Samuel.Trahan/lakes/esmf-mix/ I want:
I get this error from many ranks:
Here are the relevant lines in the files: job.sh: #SBATCH --nodes=150
#SBATCH --ntasks=6000
#SBATCH --cpus-per-task=1
srun "$exe" ufs.configure:
model_configure: quilting: .true.
quilting_restart: .true.
write_groups: 2
write_tasks_per_group: 150 input.nml: layout = 30,30 |
@SamuelTrahanNOAA that capability is on our list, it is not available yet. Also depending on which platform, if you are running with large number of cores (>6000) with slingshot 10 which has known issues with MPI_ALL2ALL call, there are issues with large number of core with ESMF managed threading, we suggest to move to slingshot 11 or ucx network, which does not have issues. The traditional threading only needs to specify the MPI tasks, which alleviate the problem. |
We can rule out ESMF threads. Now that I know how to turn the off correctly, I ran one case: 4000 ranks with 2 threads. The test failed in the expected spot in cires_tauamf_data.F90. It's always reassuring when causality holds. Something that happens before the physics init should not corrupt the heap of something that happens after the physics init. Lo and behold, with proper testing, we have ruled out causality violations as a cause of the crash. |
Removing all OpenMP directives from GFS_phys_time_vary.fv3.F90 allows the model to run. This is with 4000 MPI ranks and 2 threads per rank. I believe this is the problem:
We're calling a non-thread-safe library in a threaded region. There are multiple NetCDF calls going concurrently. I'm setting up another test where I remove the OpenMP directives around only the calls to read data in GFS_phys_time_vary.fv3.F90. If our luck holds, that will be enough to fix the problem. |
With a bug fix, a big threaded test (4000 rank, 2 thread per rank) passed four out of four tries. The fix is to disable threading in NetCDF-reading sections of GFS_phys_time_vary.fv3.F90. It is in this branch: https://github.com/SamuelTrahanNOAA/ufs-weather-model/tree/init-concurrency-bug Which would be: git clone --recursive --branch init-concurrency-bug https://github.com/SamuelTrahanNOAA/ufs-weather-model The only change is in FV3/ccpp/physics/physics/GFS_phys_time_vary.fv3.F90 |
@SamuelTrahanNOAA I was unable to reproduce the crash at C384 for with 1 thread. In case you're interested, the test case is on Hera at: |
Thank you @SamuelTrahanNOAA I'll test it out. |
In my limited tests, keeping the number of nodes fixed, 2 threads is faster than 1 or 4. This was with 200 nodes. Scaling may vary with node count. |
I have a better answer to this now. It's simple probability. For each rank, you have some probability of getting a failure. Two NetCDF calls have to overlap in the wrong way. The more ranks you have, the higher the probability is that at least one will fail. |
I've opened pull requests to fix this:
The only changes are in ccpp-physics. Others are submodule hash updates, .gitmodules, and regression test logs. |
Nice detective work Sam. @grantfirl it would be nice if the bugfix PR could have some priority since it will be needed for e.g. HR3. |
@lisa-bengtsson OK, since the fix doesn't require new baselines, it should be able to be combined with another PR and go in quickly. If combination is not possible, I'll suggest that it be the first CCPP PR in the queue after #119 that is being tested for final merge now. |
My 14-cycle C1152 test on Hera passed with the new code. So all looks good. @SamuelTrahanNOAA Thank you very much for figuring out the fix. Also thank you to Judy Henderson, Tom Henderson and George Vanderberghe for their testing and discovery of many clues that helped get to the problem. |
@mdtoyNOAA did you notice any extra cost of runtime after removing the threading ? |
I did not do any timing tests, but anyway, I didn't change anything in my settings, so I was still specifying 4 threads (and I assume they're still ESMF threads). I'll admit that my eyes glazed over during the discussion of ESMF threads, etc. @SamuelTrahanNOAA Is the final outcome that nothing has changed with the threading? |
Sam removed OpenMP in the section of code |
I only removed OpenMP from the code that was reading data. Other OpenMP directives in the file are still there. Any computation in that file is still parallelized. A faster way to read those files is to do it at the model level.
You should not use four threads. Your configuration doesn't scale well with threading on Hera. Use 2 threads instead, and increase the number of PETs. You'll get a faster simulation with the same number of nodes (or even fewer). |
Sounds good. Thanks. |
Quick question: Do we need to make a change to 'GFS_phys_time_vary.scm.F90' similar to the one @SamuelTrahanNOAA made to 'GFS_phys_time_vary.fv3.F90'. I'm guessing not because I didn't see corresponding OMP lines of code in the 'scm' module. |
No for the reason you mentioned. |
NetCDF is not thread-safe. Any code that may call NetCDF from two threads at a time needs to be fixed. It doesn't matter if it's CCPP or a graphics package. If the code does not use threading, then there is no worry. There's other forms of threading besides OpenMP. We don't use them in NWP. |
Description
Cold-start, atmosphere-only 'ufs_model.x' runs experience random run-time crashes in module FV3/ccpp/physics/physics/cires_tauamf_data.F90. This occurs at various global resolutions -- so far, C384, C768 and C1152. Often, the model runs successfully, but when crashes occur they usually occur at the first physics time step.
Steps to Reproduce
I created two "sandbox" directories on Hera:
At each resolution, I had runs that were successful and those that crashed. The output logs in each are:
Crashed runs: run_ufs.out.crash
Non-crashed runs: run_ufs.out.no_crash
In each case, the output of the successful run is in the directory: output.no_crash
The model is run by executing: 'sbatch run_ufs.sh'
Additional Context
Output
This is error output from an executable compiled in 'debug' mode with the following options:
-check bounds -check uninit -fpe0 -fno-alias -ftrapuv -traceback
4104: forrtl: severe (194): Run-Time Check Failure. The variable 'cires_tauamf_data_mp_tau_amf_interp_$IT2' is being used in '/scratch1/BMC/wrfruc/mtoy/git_local/global-workflow.test/sorc/ufs_model.fd/FV3/ccpp/physics/physics/cires_tauamf_data.F90(144,3)' without being defined
4104: Image PC Routine Line Source
4104: ufs_model.x 0000000008C16589 cires_tauamf_data 144 cires_tauamf_data.F90
4104: ufs_model.x 0000000008A67B08 gfs_phys_time_var 951 GFS_phys_time_vary.fv3.F90
4104: libiomp5.so 00002B31134A6BB3 __kmp_invoke_micr Unknown Unknown
4104: libiomp5.so 00002B3113422903 Unknown Unknown Unknown
4104: libiomp5.so 00002B3113421912 Unknown Unknown Unknown
4104: libiomp5.so 00002B31134A783C Unknown Unknown Unknown
4104: libpthread-2.17.s 00002B3115B8CEA5 Unknown Unknown Unknown
4104: libc-2.17.so 00002B3115E9FB0D clone Unknown Unknown
And here is error output from an executable compiled in 'debug' mode with the following options:
-check -check noarg_temp_created -check nopointer -warn -warn noerrors -fp-stack-check -fstack-protector-all -fpe0 -debug -ftrapuv -init=snan,arrays
708: ==== backtrace (tid: 47964) ====
708: 0 0x000000000004d455 ucs_debug_print_backtrace() ???:0
708: 1 0x00000000036e3a7a cires_tauamf_data_mp_tau_amf_interp_() /scratch1/BMC/wrfruc/mtoy/git_local/global-workflow.test/sorc/ufs_model.fd/FV3/ccpp/physics/physics/cires_tauamf_data.F90:153
708: 2 0x000000000368016a L_gfs_phys_time_vary_mp_gfs_phys_time_vary_timestep_init__865__par_region0_2_0() /scratch1/BMC/wrfruc/mtoy/git_local/global-workflow.test/sorc/ufs_model.fd/FV3/ccpp/physics/physics/GFS_phys_time_vary.fv3.F90:951
708: 3 0x000000000013fbb3 __kmp_invoke_microtask() ???:0
708: 4 0x00000000000bb903 __kmp_invoke_task_func() /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:7813
708: 5 0x00000000000ba912 __kmp_launch_thread() /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:6236
708: 6 0x000000000014083c _INTERNALa0ac8784::__kmp_launch_worker() /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/z_Linux_util.cpp:533
708: 7 0x0000000000007ea5 start_thread() pthread_create.c:0
708: 8 0x00000000000feb0d __clone() ???:0
708: =================================
708: forrtl: severe (174): SIGSEGV, segmentation fault occurred
708: Image PC Routine Line Source
708: ufs_model.x 0000000005D4901A Unknown Unknown Unknown
708: libpthread-2.17.s 00002B0960ECC630 Unknown Unknown Unknown
708: ufs_model.x 00000000036E3A7A cires_tauamf_data 153 cires_tauamf_data.F90
708: ufs_model.x 000000000368016A gfs_phys_time_var 951 GFS_phys_time_vary.fv3.F90
708: libiomp5.so 00002B095E7DEBB3 __kmp_invoke_micr Unknown Unknown
708: libiomp5.so 00002B095E75A903 Unknown Unknown Unknown
708: libiomp5.so 00002B095E759912 Unknown Unknown Unknown
708: libiomp5.so 00002B095E7DF83C Unknown Unknown Unknown
708: libpthread-2.17.s 00002B0960EC4EA5 Unknown Unknown Unknown
708: libc-2.17.so 00002B09611D7B0D clone Unknown Unknown
The text was updated successfully, but these errors were encountered: