-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to run C768L127 ufs_model.x on Orion #2134
Comments
Revert xml back to that which is generated by g-w develop log file: |
C384L127 A C384L127 gdasfcst was successfully run on Orion using the same g-w installation ( The g-w generated xml specified
The logfile for this run is |
C768L127 ufs_model.x
for C768. However, when
I manually changed the xml to read
|
@RussTreadon-NOAA |
The lack of threads in the xml file is because we are using ESMF managed threading, so even though you do not see threads in the xlm or the srun command you should see in the ufs.configure that you are using threads for the atm component. I'm trying a C768 test case on orion for one of the HR* dates to see if I can run successfully or not. (I previously could). |
Thank you @aerorahul and @JessicaMeixner-NOAA for your comments. @junwang-noaa explained ESMF thread management in an email thread. I misunderstood @WalterKolczynski-NOAA reply as meaning that I need to edit the xml. Good to know the xml generator is correct. I'll standby to see if @JessicaMeixner-NOAA can successfully run C768L127 |
@RussTreadon-NOAA my C768 test case also failed, but in a different way log file is here: /work2/noaa/marine/jmeixner/HR3/develop/devt01/COMROOT/devt01/logs/2019120300/gfsfcst.log |
Bummer, strike 2. Hmm, we'll need to poke around and figure out what's going on. |
The S2S: /work2/noaa/marine/jmeixner/HR3/develop/s2s01/COMROOT/s2s01/logs/2019120300/gfsfcst.log and ATM: /work2/noaa/marine/jmeixner/HR3/develop/atm01/COMROOT/atm01/logs/2019120300/gfsfcst.log runs are going 20 minutes in...there might have been a configuration or something off when adding waves that I'll need to look into more. If these runs complete, then I'm wondering if it's a different IC issue and/or difference in memory/configuration for the gdasfcst vs the gfsfcst that's causing your issues @RussTreadon-NOAA |
Great news! Thank you @JessicaMeixner-NOAA for making this test. Now I have something against which to compare. I'll do so and hopefully find a mistake on my part. |
@RussTreadon-NOAA let me know if you want me to run with a different IC or other configuration update if that'd be helpful. Everything's built and ready to go so should be easy to make more runs if needed. |
Thank you @JessicaMeixner-NOAA for your kind offer. Your successful run opens several threads for me to explore. Let me check 'em out. If they don't resolve my issue, I may come back to you but I'm reluctant to take up any more of your valuable time ... especially if the problem with my test is me. |
My gdas forecast is a |
I have another piece of the puzzle. I am cycling ATM-only C384/C192 on Orion using g-w develop @ 8c11eeb (Nov 23). My fcst and efcs steps are occasionally failing but will succeed upon rerun. See this sample log directory: In that directory, you can see that 2 out of 40 efcs tasks failed: enkgdasefcs23 failed twice and enkfgdasefcs27 failed once. For efcs23, the first failure looks like:
with many other backtrace output. The second failure has a slightly different message but likely the same root cause:
The third attempt in this case was successful. @RussTreadon-NOAA Have you attempted to rerun any of your failed cases with no changes? |
@CatherineThomas-NOAA , thank you for sharing your experiences on Orion. Yes, I have rerun the C768L127 2021080100 gdasfcst case several times without changes. It fails in the same way. I was able to successfully run gdasfcst at C384L127 for a different case, 2021081418. |
The 2021080100 gdasfcst is trying to warm start
Can we warm start |
@RussTreadon-NOAA We haven't done such test yet. It looks to me you are running atm-only. I'd suggest using debug mode to see if there is issue with the new physics package (adding "-DDEBUG=ON" as a compile option). @yangfanglin @mdtoyNOAA FYI. |
Thank you @junwang-noaa for the suggestion.
While it may be possible to get pass the I don't think we can run Let me discuss with the DAD team the best path forward. We were using operational files for gdas-validation since operations is the control against which are validating JEDI-based DA. We need to ensure that the JEDI-based h(x) and solver are comparable to and consistent with the GSI-based h(x) and solver. |
A rebuild of
This is consistent with side conversations on this issue. Looks like I'll need to go through config and/or other files to get everything set properly for Using g-w |
Remove The following steps got the 2021080100 gdasfcst running
The gdasfcst is currently running and likely will not finish within the specified 1 hour wall clock time. I not proficient with chgres_cube.sh so I'm not 100% confident I correctly configured everything. If we want the first cycle of gdas-validation to be 2021080100, we could
These background files will likely be noisy since we are cold starting the FV3_GFS_v17_p8_ugwpv1 from FV3_GFS_v16 restarts. |
Close this issue since 2021080100 gdasfcst runs. |
@RussTreadon-NOAA Thanks for confirming the cold start run is working. So far we have an IAU coupled test set up by: 1) run cold start test with "CCPP_SUITE=FV3_GFS_v17_p8_ugwpv1" 2) set up a warm start run using the restart files created in 1) and use the IAU files for atmosphere and ocean and start the run. The test is running at this time. |
Thank you @junwang-noaa . Good to know about these tests. Atmospheric gdas-validation has not yet gotten to IAU. |
FYI, interesting behavior for cold start 2021073118 gdasfcst When
The log file for the failed run is
|
@RussTreadon-NOAA, FYI. @mdtoyNOAA is aware of this error in another case and is debugging it. |
Thank you @junwang-noaa for letting me know that this is a known problem and is being actively investigated. |
FYI, I submitted Issue #136 in the ufs-community/ccpp-physics repository to hopefully solve this bug. |
What is wrong?
Various attempts to run
ufs_model.x
at C768L127 fail on Orion.What should have happened?
ufs_model.x
should be able to run on Orion at C768L127.What machines are impacted?
Orion
Steps to reproduce
Additional information
Log files for various failed attempts are available in
/work2/noaa/da/rtreadon/gdas-validation/comrot/gdas_eval_satwind_JEDI/logs/2021080100/
Do you have a proposed solution?
I am currently tinkering with various task, ppn, and thread counts. No successful runs yet.
The text was updated successfully, but these errors were encountered: