Issues with Reproducibility Across Different Jet Partitions #2514

erinaj16 · 2024-11-25T19:00:15Z

erinaj16
Nov 25, 2024

I have been trying to run a C384 UFS forecast on Jet, but I have been running into an issue with reproducibility when I run it on different partitions. I am running the forecast as a single threaded task with an 8x8 layout and 2 write groups with 6 write tasks per group. When I run the forecast on the vjet or sjet partitions, I get the same forecast. However, when I run the same forecast with the exact same set up on the xjet partition, I get substantially different results. The forecast is reproducible using the same partition (e.g., if I reran the forecast on xjet using the same inputs, I would get the same result as the original xjet run), just not between most partitions. Does anyone else have any experience with this type of issue? Is the issue with how I am setting up the layout or how the architecture is set up across partitions on Jet? Alternatively, is there a compiler option (using Intel) I should be using to ensure reproducibility across partitions? Thank you for any help you can give!

gspetro-NOAA · 2024-11-26T03:35:46Z

gspetro-NOAA
Nov 26, 2024
Maintainer

Hi @erinaj16 ,

In general, we encourage users to run on xjet. This seems to be the partition where the UFS consistently works. However, I will reach out to my team to see if they know anything about what might be causing the differences, and of course, anyone else with experience is welcome to chime in!

Best,
Gillian Petro

0 replies

gspetro-NOAA · 2024-11-26T17:40:17Z

gspetro-NOAA
Nov 26, 2024
Maintainer

Hi @erinaj16 ,

@christopherwharrop-noaa kindly provided a detailed explanation of the likely problem, along with some references. I have included it below, edited somewhat for brevity. I think our main question is what you mean in this case by "substantially different" results? Chris indicated that: (1) users can't expect the same results on different partitions that don't have the same hardware, and (2) the results are probably "the same" when statistically verified against truth (e.g., using METplus verification techniques or comparing to an ensemble). However, it is possible that this is a bug, so it would be useful to know more about how you are performing your comparisons.

Chris' detailed explanation:
In general, one should not expect to get the same results, even with identical input, when one of the following is changed: (1) the hardware, (2) the compiler vendor, (3) the compiler options, (4) the compiler version. Numerical weather prediction models simulate a lot of processes that are sensitive to changes in how a computation is carried out, particularly in physics. A lot of things happen in hardware and during compiler optimizations that change which instructions get executed and in what order. Given that a model executes billions/trillions of instructions from start to finish, it is not surprising that results are different. The key question is what "different" means. Are the differences just point-wise differences in fields? Or are they statistically significant differences such that the resulting forecast is not the same when processed by verification tools? You can have very large point-wise differences in fields without changing the forecast. For example, if it is raining at gridpoint (i,j) in one forecast, a small, normal, roundoff error in numerics can make it rain at gridpoint (i+1,j) instead. That can cause dramatic point-wise differences in fields as other variables, like temperature, etc., are affected. But, overall, the forecast is "the same" when you consider it in relation to an ensemble of forecasts generated by changing hardware/compiler options.

NCAR developed an ensemble method to check whether two forecasts generated on different hardware/compiler were "the same." With this method, an ensemble is created by running the code with various selections of compilers/hardware. Forecasts from code ported to new architectures are typically validated by checking whether they are consistent with that ensemble or not. If a difference is statistically outside the ensemble, then a red flag is raised. Otherwise, it is considered "correct." Here are a few resources on the topic:

Best,
Gillian

0 replies

erinaj16 · 2024-11-26T19:56:37Z

erinaj16
Nov 26, 2024
Author

Thank you @gspetro-NOAA and @christopherwharrop-noaa for your insight!

Here is the difference in zonal wind (U) at ~250 hPa after 3h, 9h, and 99h of model integration. (I just subtracted the xjet forecast minus the vjet forecast, though I can look into further verification techniques that you suggested.) The differences at 3h seem to be beyond what I would expect from bitwise differences. For example, near the Bahamas/eastern FL coast, where there is a hurricane present, the difference seems more based on the dynamics of the region. By 99h, the differences are large globally (exceeding 1 m/s over large portions of the globe locations and up to a ~30 m/s difference). This result is changing the location and magnitude of several jet features in my forecast.

Note: I am using 4DIAU (with the same increment files between partitions) with a 6h window, meaning that the forecast should be still incrementally adjusted towards the analysis for the 3h difference plot but should be outside the adjustment window for the 9h and 99h plots.

0 replies

christopherwharrop-noaa · 2024-11-26T20:48:08Z

christopherwharrop-noaa
Nov 26, 2024

At first glance, none of these results look alarming to me. Differences, whether they are caused by bugs or by variations in platform, grow over time. I do not think it is useful to compare 99hr output. All bets are off after that much computation. I know those differences look pretty big, but I'm not convinced they're wrong. Particularly since they are just point-wise differences in U. I'm not saying there isn't a problem. I'm just saying there isn't enough evidence to draw that conclusion yet.

You could try building the code with debug options, turning off all optimization (for both systems), and things like fma and vectorization, just to see what that does; it might help you get closer to bitwise reproducibility. But, in the long run, bitwise reproducibility across platforms is not a realistic expectation, and other methods for comparisons (see references for approaches) should be used.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with Reproducibility Across Different Jet Partitions #2514

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Issues with Reproducibility Across Different Jet Partitions #2514

erinaj16 Nov 25, 2024

Replies: 4 comments

gspetro-NOAA Nov 26, 2024 Maintainer

gspetro-NOAA Nov 26, 2024 Maintainer

erinaj16 Nov 26, 2024 Author

christopherwharrop-noaa Nov 26, 2024

erinaj16
Nov 25, 2024

gspetro-NOAA
Nov 26, 2024
Maintainer

gspetro-NOAA
Nov 26, 2024
Maintainer

erinaj16
Nov 26, 2024
Author

christopherwharrop-noaa
Nov 26, 2024