Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run C768L127 ufs_model.x on Orion #2134

Closed
RussTreadon-NOAA opened this issue Dec 7, 2023 · 27 comments
Closed

Unable to run C768L127 ufs_model.x on Orion #2134

RussTreadon-NOAA opened this issue Dec 7, 2023 · 27 comments

Comments

@RussTreadon-NOAA
Copy link
Contributor

What is wrong?

Various attempts to run ufs_model.x at C768L127 fail on Orion.

What should have happened?

ufs_model.x should be able to run on Orion at C768L127.

What machines are impacted?

Orion

Steps to reproduce

  1. clone g-w develop
  2. execute checkout and build_all
  3. create EXPDIR and populate COMROT for C768L127
  4. submit gdasfcst

Additional information

Log files for various failed attempts are available in /work2/noaa/da/rtreadon/gdas-validation/comrot/gdas_eval_satwind_JEDI/logs/2021080100/

Do you have a proposed solution?

I am currently tinkering with various task, ppn, and thread counts. No successful runs yet.

@RussTreadon-NOAA RussTreadon-NOAA added bug Something isn't working triage Issues that are triage labels Dec 7, 2023
@RussTreadon-NOAA
Copy link
Contributor Author

Revert xml back to that which is generated by g-w develop setup_xml.py with g-w config* files. gdasfcst failed with a segmentation fault.

log file: /work2/noaa/da/rtreadon/gdas-validation/comrot/gdas_eval_satwind_JEDI/logs/2021080100/gdasfcst.log
run directory: /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_JEDI/fcst.234587

@RussTreadon-NOAA
Copy link
Contributor Author

C384L127 ufs_model.x

A C384L127 gdasfcst was successfully run on Orion using the same g-w installation (/work2/noaa/da/rtreadon/gdas-validation/global-workflow) as the failed C768L127 run.

The g-w generated xml specified <nodes>32:ppn=40:tpp=1</nodes> for gdasfcst. The srun command used to run ufs_model.x was

+ exglobal_forecast.sh[149]: /bin/cp -p /work2/noaa/da/rtreadon/gdas-validation/global-workflow/exec/ufs_model.x /work2/noaa/stmp/rtreadon/RUNDIRS/pratmos/fcst.381480/
+ exglobal_forecast.sh[150]: srun -l --export=ALL -n 1280 /work2/noaa/stmp/rtreadon/RUNDIRS/pratmos/fcst.381480/ufs_model.x
   0:
   0:
   0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
   0:      PROGRAM ufs       HAS BEGUN. COMPILED       0.00     ORG: np23
   0:      STARTING DATE-TIME  DEC 07,2023  18:17:20.320  341  THU   2460286

The logfile for this run is /work2/noaa/stmp/rtreadon/comrot/pratmos/logs/2021081418/gdasfcst.log. The run directory is /work2/noaa/stmp/rtreadon/RUNDIRS/pratmos/fcst.381480.

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Dec 7, 2023
@RussTreadon-NOAA
Copy link
Contributor Author

C768L127 ufs_model.x

To change the number of threads, you should change the value of nthreads_fv3 and nthreads_fv3_gfs for the appropriate resolution in config.ufs. Those are the values passed to ESMF for threading. You'll need to regenerate the XML afterwards, since changing the threads will change the footprint.
Walter

config.ufs specifies

        export nthreads_fv3=4
        export nthreads_fv3_gfs=4

for C768. However, when setup_xml.py runs it generates an xml file with <nodes>70:ppn=40:tpp=1</nodes> g-w runs ufs_model.x as

+ exglobal_forecast.sh[149]: /bin/cp -p /work2/noaa/da/rtreadon/gdas-validation/global-workflow/exec/ufs_model.x /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_JEDI/fcst.168575/
+ exglobal_forecast.sh[150]: srun -l --export=ALL -n 2800 /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_JEDI/fcst.168575/ufs_model.x
   0:
   0:
   0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
   0:      PROGRAM ufs       HAS BEGUN. COMPILED       0.00     ORG: np23
   0:      STARTING DATE-TIME  DEC 07,2023  18:39:38.843  341  THU   2460286

ufs_model.x eventually aborts with

   0:  ==============
   0:  final results
   0:  ==============
   0:  dbgx --fixratio: F F F F
   4:  ncells=           5
   4:  nlives=          12
   4:  nthresh=   18.0000000000000
 928: [Orion-02-72:187237:0:187237] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
 304: [Orion-02-56:304230:0:304404] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
1096: [Orion-03-04:340057:2:340211] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

I manually changed the xml to read <nodes>280:ppn=10:tpp=4</nodes> and resubmitted gdasfcst. ufs_model.x aborted with

1368: FATAL from PE   410: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic
1368:
1360:
1360: FATAL from PE   408: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic
1360:
1890: Image              PC                Routine            Line        Source
1890: ufs_model.x        0000000005CB2AF7  Unknown               Unknown  Unknown
1890: ufs_model.x        0000000005886739  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
1890: ufs_model.x        00000000058B99DF  mpp_domains_mod_m        1272  mpp_domains_define.inc
1890: ufs_model.x        0000000002FC6892  fv_mp_mod_mp_doma         644  fv_mp_mod.F90
1890: ufs_model.x        0000000002C24109  fv_control_mod_mp         666  fv_control.F90
1890: ufs_model.x        0000000002BB1D28  atmosphere_mod_mp         352  atmosphere.F90
1890: ufs_model.x        0000000002AB07C7  atmos_model_mod_m         565  atmos_model.F90
1890: ufs_model.x        000000000292F049  module_fcst_grid_         787  module_fcst_grid_comp.F90

@aerorahul
Copy link
Contributor

C768L127 ufs_model.x

To change the number of threads, you should change the value of nthreads_fv3 and nthreads_fv3_gfs for the appropriate resolution in config.ufs. Those are the values passed to ESMF for threading. You'll need to regenerate the XML afterwards, since changing the threads will change the footprint.
Walter

config.ufs specifies

        export nthreads_fv3=4
        export nthreads_fv3_gfs=4

for C768. However, when setup_xml.py runs it generates an xml file with <nodes>70:ppn=40:tpp=1</nodes> g-w runs ufs_model.x as

+ exglobal_forecast.sh[149]: /bin/cp -p /work2/noaa/da/rtreadon/gdas-validation/global-workflow/exec/ufs_model.x /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_JEDI/fcst.168575/
+ exglobal_forecast.sh[150]: srun -l --export=ALL -n 2800 /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_JEDI/fcst.168575/ufs_model.x
   0:
   0:
   0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
   0:      PROGRAM ufs       HAS BEGUN. COMPILED       0.00     ORG: np23
   0:      STARTING DATE-TIME  DEC 07,2023  18:39:38.843  341  THU   2460286

ufs_model.x eventually aborts with

   0:  ==============
   0:  final results
   0:  ==============
   0:  dbgx --fixratio: F F F F
   4:  ncells=           5
   4:  nlives=          12
   4:  nthresh=   18.0000000000000
 928: [Orion-02-72:187237:0:187237] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
 304: [Orion-02-56:304230:0:304404] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
1096: [Orion-03-04:340057:2:340211] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

I manually changed the xml to read <nodes>280:ppn=10:tpp=4</nodes> and resubmitted gdasfcst. ufs_model.x aborted with

1368: FATAL from PE   410: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic
1368:
1360:
1360: FATAL from PE   408: mpp_domains_define.inc: At least one pe in pelist is not used by any tile in the mosaic
1360:
1890: Image              PC                Routine            Line        Source
1890: ufs_model.x        0000000005CB2AF7  Unknown               Unknown  Unknown
1890: ufs_model.x        0000000005886739  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
1890: ufs_model.x        00000000058B99DF  mpp_domains_mod_m        1272  mpp_domains_define.inc
1890: ufs_model.x        0000000002FC6892  fv_mp_mod_mp_doma         644  fv_mp_mod.F90
1890: ufs_model.x        0000000002C24109  fv_control_mod_mp         666  fv_control.F90
1890: ufs_model.x        0000000002BB1D28  atmosphere_mod_mp         352  atmosphere.F90
1890: ufs_model.x        0000000002AB07C7  atmos_model_mod_m         565  atmos_model.F90
1890: ufs_model.x        000000000292F049  module_fcst_grid_         787  module_fcst_grid_comp.F90

@RussTreadon-NOAA
This is so because ESMF manages the threading in the forecast model. Even though tpp=1 is seen in the XML, the calculations that went into computing nodes and ppn account for the number of threads the executable is expected to run with.

@JessicaMeixner-NOAA
Copy link
Contributor

The lack of threads in the xml file is because we are using ESMF managed threading, so even though you do not see threads in the xlm or the srun command you should see in the ufs.configure that you are using threads for the atm component.

I'm trying a C768 test case on orion for one of the HR* dates to see if I can run successfully or not. (I previously could).

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @aerorahul and @JessicaMeixner-NOAA for your comments. @junwang-noaa explained ESMF thread management in an email thread. I misunderstood @WalterKolczynski-NOAA reply as meaning that I need to edit the xml. Good to know the xml generator is correct. I'll standby to see if @JessicaMeixner-NOAA can successfully run C768L127 ufs_model.x on Orion.

@JessicaMeixner-NOAA
Copy link
Contributor

@RussTreadon-NOAA my C768 test case also failed, but in a different way log file is here: /work2/noaa/marine/jmeixner/HR3/develop/devt01/COMROOT/devt01/logs/2019120300/gfsfcst.log
I did run APP=S2SW out of habbit and I did just check that you are running ATM. I'll now try that, but wanted to let you know that I also was having issues with C768 on orion. I can't tell you exactly when the last time I ran this exact case I ran today, but I have run it successfully on orion before so this is not expected.

@RussTreadon-NOAA
Copy link
Contributor Author

Bummer, strike 2. Hmm, we'll need to poke around and figure out what's going on.

@JessicaMeixner-NOAA
Copy link
Contributor

The S2S: /work2/noaa/marine/jmeixner/HR3/develop/s2s01/COMROOT/s2s01/logs/2019120300/gfsfcst.log and ATM: /work2/noaa/marine/jmeixner/HR3/develop/atm01/COMROOT/atm01/logs/2019120300/gfsfcst.log runs are going 20 minutes in...there might have been a configuration or something off when adding waves that I'll need to look into more. If these runs complete, then I'm wondering if it's a different IC issue and/or difference in memory/configuration for the gdasfcst vs the gfsfcst that's causing your issues @RussTreadon-NOAA

@RussTreadon-NOAA
Copy link
Contributor Author

Great news! Thank you @JessicaMeixner-NOAA for making this test. Now I have something against which to compare. I'll do so and hopefully find a mistake on my part.

@JessicaMeixner-NOAA
Copy link
Contributor

@RussTreadon-NOAA let me know if you want me to run with a different IC or other configuration update if that'd be helpful. Everything's built and ready to go so should be easy to make more runs if needed.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @JessicaMeixner-NOAA for your kind offer. Your successful run opens several threads for me to explore. Let me check 'em out. If they don't resolve my issue, I may come back to you but I'm reluctant to take up any more of your valuable time ... especially if the problem with my test is me.

@RussTreadon-NOAA
Copy link
Contributor Author

My gdas forecast is a warm_start=.true. with MODE=cycled whereas @JessicaMeixner-NOAA successful gfs forecast is a cold start, warm_start=.false., with MODE=forecast-only. This shouldn't matter but it is a difference. Let me dig more.

@CatherineThomas-NOAA
Copy link
Contributor

CatherineThomas-NOAA commented Dec 8, 2023

I have another piece of the puzzle. I am cycling ATM-only C384/C192 on Orion using g-w develop @ 8c11eeb (Nov 23). My fcst and efcs steps are occasionally failing but will succeed upon rerun. See this sample log directory: /work2/noaa/stmp/cthomas/ROTDIRS/v17spread2/logs/2023040518

In that directory, you can see that 2 out of 40 efcs tasks failed: enkgdasefcs23 failed twice and enkfgdasefcs27 failed once. For efcs23, the first failure looks like:

+ exglobal_forecast.sh[150]: srun -l --export=ALL -n 240 /work2/noaa/stmp/cthomas/RUNDIRS/v17spread2/efcs.45395/mem045/ufs_model.x
 84: [Orion-25-40:271681:0:271681] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2abfd8dd1009)
 99: [Orion-25-40:271696:0:271696] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b5e40c2e009)
100: [Orion-25-40:271697:0:271697] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2b8cc6604009)
...
  9: [Orion-25-38:51616:0:51616]     address.c:1052 Assertion `*addr_version == UCP_OBJECT_VERSION_V2' failed: addr version 9
 39: [Orion-25-38:51647:0:51647]     address.c:1052 Assertion `*addr_version == UCP_OBJECT_VERSION_V2' failed: addr version 9
 35: [Orion-25-38:51643:0:51643]     address.c:1052 Assertion `*addr_version == UCP_OBJECT_VERSION_V2' failed: addr version 9
 34: [Orion-25-38:51642:0:51642]     address.c:1052 Assertion `*addr_version == UCP_OBJECT_VERSION_V2' failed: addr version 9
 26: [Orion-25-38:51634:0:51634]     address.c:1052 Assertion `*addr_version == UCP_OBJECT_VERSION_V2' failed: addr version 9
....
146: forrtl: error (76): Abort trap signal
146: Image              PC                Routine            Line        Source
146: ufs_model.x        0000000005CBEF2B  Unknown               Unknown  Unknown
146: libpthread-2.17.s  00002B79DD3115D0  Unknown               Unknown  Unknown
146: libc-2.17.so       00002B79DD76C2C7  gsignal               Unknown  Unknown
146: libc-2.17.so       00002B79DD76D9B8  abort                 Unknown  Unknown
146: libucs.so.0.0.0    00002B7B329C7C65  ucs_fatal_error_m     Unknown  Unknown
146: libucs.so.0.0.0    00002B7B329C7DF9  Unknown               Unknown  Unknown
146: libucp.so.0.0.0    00002B7B32509FCE  Unknown               Unknown  Unknown
146: libucp.so.0.0.0    00002B7B3250B7D3  ucp_address_unpac     Unknown  Unknown
146: libucp.so.0.0.0    00002B7B324BF4B2  ucp_ep_create         Unknown  Unknown
146: libmlx-fi.so       00002B7B3224E8DB  Unknown               Unknown  Unknown
146: libmpi.so.12.0.0   00002B79DC18FB75  Unknown               Unknown  Unknown
146: libmpi.so.12.0.0   00002B79DBCDD9DD  Unknown               Unknown  Unknown
146: libmpi.so.12.0.0   00002B79DBFF6D1D  PMPI_Init_thread      Unknown  Unknown
146: ufs_model.x        0000000000A36FEF  Unknown               Unknown  Unknown
146: ufs_model.x        00000000009522DF  Unknown               Unknown  Unknown
146: ufs_model.x        00000000009B9895  Unknown               Unknown  Unknown
146: ufs_model.x        0000000000C2DDB0  Unknown               Unknown  Unknown
146: ufs_model.x        0000000000AF3EBB  Unknown               Unknown  Unknown
146: ufs_model.x        0000000000AF65A8  Unknown               Unknown  Unknown
146: ufs_model.x        0000000000435E74  MAIN__                     94  UFS.F90
146: ufs_model.x        0000000000435B92  Unknown               Unknown  Unknown
146: libc-2.17.so       00002B79DD758495  __libc_start_main     Unknown  Unknown
146: ufs_model.x        0000000000435AA9  Unknown               Unknown  Unknown
...

with many other backtrace output. The second failure has a slightly different message but likely the same root cause:

+ exglobal_forecast.sh[150]: srun -l --export=ALL -n 240 /work2/noaa/stmp/cthomas/RUNDIRS/v17spread2/efcs.52132/mem045/ufs_model.x
 83: [1702008643.111281] [Orion-25-40:272473:0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
103: [1702008643.111281] [Orion-25-40:272493:0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
119: [1702008643.111269] [Orion-25-40:272509:0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
 81: [1702008643.111440] [Orion-25-40:272471:0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
 82: [1702008643.111337] [Orion-25-40:272472:0]         select.c:438  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
 82: Error in system call pthread_mutex_destroy: Device or resource busy
 82:     ../../src/mpi/init/init_thread_cs.c:66
 82: Abort(1090703) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
 82: MPIR_Init_thread(143)........:
 82: MPID_Init(1310)..............:
 82: MPIDI_OFI_mpi_init_hook(1974): OFI get address vector map failed

The third attempt in this case was successful.

@RussTreadon-NOAA Have you attempted to rerun any of your failed cases with no changes?

@RussTreadon-NOAA
Copy link
Contributor Author

@CatherineThomas-NOAA , thank you for sharing your experiences on Orion. Yes, I have rerun the C768L127 2021080100 gdasfcst case several times without changes. It fails in the same way. I was able to successfully run gdasfcst at C384L127 for a different case, 2021081418.

@RussTreadon-NOAA
Copy link
Contributor Author

The 2021080100 gdasfcst is trying to warm start ufs_model.x from the operational C768L127 RESTART and atminc.nc.

config.ufs sets CCPP_SUITE="FV3_GFS_v17_p8_ugwpv1".

Can we warm start ufs_model.x from operational files using CCPP_SUITE=FV3_GFS_v17_p8_ugwpv1? Should I instead recompile and run the model with CCPP_SUITES="FV3_GFS_V16"?

@junwang-noaa
Copy link
Contributor

@RussTreadon-NOAA We haven't done such test yet. It looks to me you are running atm-only. I'd suggest using debug mode to see if there is issue with the new physics package (adding "-DDEBUG=ON" as a compile option).

@yangfanglin @mdtoyNOAA FYI.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @junwang-noaa for the suggestion. ufs_model.x rebuilt with -DDEBUG=ON. Traceback below

   0:  in radiation_clouds_prop=           8 F           4 F F           2           1
   0:  in radiation_clouds_prop=           8 F           4 F F           2           1
   0:  in radiation_clouds_prop=           8 F           4 F F           2           1
 544: forrtl: severe (408): fort: (3): Subscript #1 of the array SNOW_LEVEL_ICE has value -9998 which is less than the lower bound of -2
 544:
 544: forrtl: severe (408): fort: (3): Subscript #1 of the array SNOW_LEVEL_ICE has value -9998 which is less than the lower bound of -2
 544:
 544: forrtl: severe (408): fort: (3): Subscript #1 of the array SNOW_LEVEL_ICE has value -66665 which is less than the lower bound of -2
 544:
 544: *** longjmp causes uninitialized stack frame ***: /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_JEDI/fcst.343314/ufs_model.x terminated

While it may be possible to get pass the SNOW_LEVEL_ICE problem, I expect more problems to arise.

I don't think we can run ufs_model.x with CCPP_SUITE=FV3_GFS_v17_p8_ugwpv1 from operational C768L127 restarts for 2021080100. I base this on conversations with others in EMC. The current develop of the model in g-w cannot run from operational GFS restarts without significant preprocessing.

Let me discuss with the DAD team the best path forward. We were using operational files for gdas-validation since operations is the control against which are validating JEDI-based DA. We need to ensure that the JEDI-based h(x) and solver are comparable to and consistent with the GSI-based h(x) and solver.

@RussTreadon-NOAA
Copy link
Contributor Author

A rebuild of ufs_model.x with FV3_GFS_v16 included in the list of suites confirms that this alone is not sufficient. The C768L127 2021080100 gdasfcst failed with

   0: NOTE from PE     0: No CA_sgs restarts - cold starting CA
  80: An error occurred in ccpp_physics_init
  80: An error occured in fv_sat_adj_init: Logic error: fv_sat_adj_init is called but do_sat_adj is set to false
  80:
  80: FATAL from PE    20: Call to CCPP physics_init step failed
  80:
 120: An error occurred in ccpp_physics_init
 120: An error occured in fv_sat_adj_init: Logic error: fv_sat_adj_init is called but do_sat_adj is set to false
 120:
 120: FATAL from PE    30: Call to CCPP physics_init step failed
 120:

This is consistent with side conversations on this issue. Looks like I'll need to go through config and/or other files to get everything set properly for ufs_model.x to run in this mode. We don't maintain (and probably don't want to) maintain this mode in g-w develop. g-w develop supports GFS v17 and beyond.

Using g-w dev/gfs16 won't work since this branch doesn't include JEDI g-w updates.

@RussTreadon-NOAA RussTreadon-NOAA removed the bug Something isn't working label Dec 8, 2023
@RussTreadon-NOAA
Copy link
Contributor Author

Remove bug label as the problem I encountered is not a bug. I am trying to do something which g-w develop does not support out of the box.

The following steps got the 2021080100 gdasfcst running

  • chgres_cube operational RESTART/20210801.00 to create cold start input tiles
  • cold start 2021080100 gdasfcst using existing g-w setup from cold start input

The gdasfcst is currently running and likely will not finish within the specified 1 hour wall clock time.

I not proficient with chgres_cube.sh so I'm not 100% confident I correctly configured everything.

If we want the first cycle of gdas-validation to be 2021080100, we could

  • pull operational C768L127 RESTART for 2021073118 from HPSS
  • chgres_cube the 2021073118 RESTART to create cold start input tiles
  • cold start 2021073118 gdasfcst to generate the background files needed by the 2021080100 DA cycle

These background files will likely be noisy since we are cold starting the FV3_GFS_v17_p8_ugwpv1 from FV3_GFS_v16 restarts.

@RussTreadon-NOAA
Copy link
Contributor Author

Close this issue since 2021080100 gdasfcst runs.

@junwang-noaa
Copy link
Contributor

@RussTreadon-NOAA Thanks for confirming the cold start run is working. So far we have an IAU coupled test set up by: 1) run cold start test with "CCPP_SUITE=FV3_GFS_v17_p8_ugwpv1" 2) set up a warm start run using the restart files created in 1) and use the IAU files for atmosphere and ocean and start the run. The test is running at this time.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @junwang-noaa . Good to know about these tests. Atmospheric gdas-validation has not yet gotten to IAU.

@RussTreadon-NOAA
Copy link
Contributor Author

FYI, interesting behavior for cold start 2021073118 gdasfcst

When ufs_model.x is compiled with -DDEBUG=ON, the cold start 2021073118 gdasfcst aborts with

 136:  enter get_nggps_ic is= 641 ie= 704 js=  97 je= 144 isd= 638 ied= 707 jsd=  94 jed= 147
 136: [Orion-16-03:22592:0:22797] Caught signal 8 (Floating point exception: floating-point invalid operation)
 136: ==== backtrace (tid:  22797) ====
 136:  0 0x000000000940f102 cires_tauamf_data_mp_tau_amf_interp_()  /work2/noaa/da/rtreadon/gdas-validation/global-workflow/sorc/ufs_model.fd/FV3/ccpp/physics/physics/cires_tauamf_data.F90:138
 136:  1 0x00000000091cd980 L_gfs_phys_time_vary_mp_gfs_phys_time_vary_timestep_init__846__par_region0_2_39()  /work2/noaa/da/rtreadon/gdas-validation/global-workflow/sorc/ufs_model.fd/FV3/ccpp/physics/physics/GFS_phys_time_vary.fv3.F90:963
 136:  2 0x000000000013fbb3 __kmp_invoke_microtask()  ???:0
 136:  3 0x00000000000bb903 __kmp_invoke_task_func()  /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:7813
 136:  4 0x00000000000ba912 __kmp_launch_thread()  /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:6236
 136:  5 0x000000000014083c _INTERNALa0ac8784::__kmp_launch_worker()  /nfs/site/proj/openmp/promo/20211013/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/z_Linux_util.cpp:533
 136:  6 0x0000000000007dd5 start_thread()  pthread_create.c:0
 136:  7 0x00000000000fe02d __clone()  ???:0
 136: =================================
 136: forrtl: error (75): floating point exception
 136: Image              PC                Routine            Line        Source
 136: ufs_model.x        0000000011EE3A3B  Unknown               Unknown  Unknown
 136: libpthread-2.17.s  00002AB1D2CC85D0  Unknown               Unknown  Unknown
 136: ufs_model.x        000000000940F102  cires_tauamf_data         138  cires_tauamf_data.F90
 136: ufs_model.x        00000000091CD980  gfs_phys_time_var         963  GFS_phys_time_vary.fv3.F90
 136: libiomp5.so        00002AB1D020DBB3  __kmp_invoke_micr     Unknown  Unknown
 136: libiomp5.so        00002AB1D0189903  Unknown               Unknown  Unknown
 136: libiomp5.so        00002AB1D0188912  Unknown               Unknown  Unknown
 136: libiomp5.so        00002AB1D020E83C  Unknown               Unknown  Unknown
 136: libpthread-2.17.s  00002AB1D2CC0DD5  Unknown               Unknown  Unknown
 136: libc-2.17.so       00002AB1D31EB02D  clone                 Unknown  Unknown

The log file for the failed run is /work2/noaa/da/rtreadon/gdas-validation/comrot/gdas_eval_satwind_JEDI/logs/2021073118/gdasfcst.log.1. The failed job ran in /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_JEDI/fcst.32437

ufs_model.x was recompiled without -DDEBUG=ON and the 2021073118 gdasfcst rerun. The optimized build successfully completed in 644.529317 seconds. The log file for the successful run is /work2/noaa/da/rtreadon/gdas-validation/comrot/gdas_eval_satwind_JEDI/logs/2021073118/gdasfcst.log.0. The successful job ran in /work/noaa/stmp/rtreadon/RUNDIRS/gdas_eval_satwind_JEDI/fcst.3748.

@junwang-noaa
Copy link
Contributor

@RussTreadon-NOAA, FYI. @mdtoyNOAA is aware of this error in another case and is debugging it.

@RussTreadon-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA, FYI. @mdtoyNOAA is aware of this error in another case and is debugging it.

Thank you @junwang-noaa for letting me know that this is a known problem and is being actively investigated.

@mdtoyNOAA
Copy link
Contributor

FYI, I submitted Issue #136 in the ufs-community/ccpp-physics repository to hopefully solve this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants