Replies: 4 comments
-
Hi @erinaj16 , In general, we encourage users to run on Best, |
Beta Was this translation helpful? Give feedback.
-
Hi @erinaj16 , @christopherwharrop-noaa kindly provided a detailed explanation of the likely problem, along with some references. I have included it below, edited somewhat for brevity. I think our main question is what you mean in this case by "substantially different" results? Chris indicated that: (1) users can't expect the same results on different partitions that don't have the same hardware, and (2) the results are probably "the same" when statistically verified against truth (e.g., using METplus verification techniques or comparing to an ensemble). However, it is possible that this is a bug, so it would be useful to know more about how you are performing your comparisons. Chris' detailed explanation: NCAR developed an ensemble method to check whether two forecasts generated on different hardware/compiler were "the same." With this method, an ensemble is created by running the code with various selections of compilers/hardware. Forecasts from code ported to new architectures are typically validated by checking whether they are consistent with that ensemble or not. If a difference is statistically outside the ensemble, then a red flag is raised. Otherwise, it is considered "correct." Here are a few resources on the topic:
Best, |
Beta Was this translation helpful? Give feedback.
-
Thank you @gspetro-NOAA and @christopherwharrop-noaa for your insight! Here is the difference in zonal wind (U) at ~250 hPa after 3h, 9h, and 99h of model integration. (I just subtracted the xjet forecast minus the vjet forecast, though I can look into further verification techniques that you suggested.) The differences at 3h seem to be beyond what I would expect from bitwise differences. For example, near the Bahamas/eastern FL coast, where there is a hurricane present, the difference seems more based on the dynamics of the region. By 99h, the differences are large globally (exceeding 1 m/s over large portions of the globe locations and up to a ~30 m/s difference). This result is changing the location and magnitude of several jet features in my forecast. Note: I am using 4DIAU (with the same increment files between partitions) with a 6h window, meaning that the forecast should be still incrementally adjusted towards the analysis for the 3h difference plot but should be outside the adjustment window for the 9h and 99h plots. |
Beta Was this translation helpful? Give feedback.
-
At first glance, none of these results look alarming to me. Differences, whether they are caused by bugs or by variations in platform, grow over time. I do not think it is useful to compare 99hr output. All bets are off after that much computation. I know those differences look pretty big, but I'm not convinced they're wrong. Particularly since they are just point-wise differences in U. I'm not saying there isn't a problem. I'm just saying there isn't enough evidence to draw that conclusion yet. You could try building the code with debug options, turning off all optimization (for both systems), and things like fma and vectorization, just to see what that does; it might help you get closer to bitwise reproducibility. But, in the long run, bitwise reproducibility across platforms is not a realistic expectation, and other methods for comparisons (see references for approaches) should be used. |
Beta Was this translation helpful? Give feedback.
-
I have been trying to run a C384 UFS forecast on Jet, but I have been running into an issue with reproducibility when I run it on different partitions. I am running the forecast as a single threaded task with an 8x8 layout and 2 write groups with 6 write tasks per group. When I run the forecast on the vjet or sjet partitions, I get the same forecast. However, when I run the same forecast with the exact same set up on the xjet partition, I get substantially different results. The forecast is reproducible using the same partition (e.g., if I reran the forecast on xjet using the same inputs, I would get the same result as the original xjet run), just not between most partitions. Does anyone else have any experience with this type of issue? Is the issue with how I am setting up the layout or how the architecture is set up across partitions on Jet? Alternatively, is there a compiler option (using Intel) I should be using to ensure reproducibility across partitions? Thank you for any help you can give!
Beta Was this translation helpful? Give feedback.
All reactions