Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fix chunk size when compression is soft-disabled #905

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

pgrete
Copy link
Collaborator

@pgrete pgrete commented Jul 11, 2023

PR Summary

In #899 I disabled the deflate filter when compression was soft disabled.
I missed that setting the chunking size was also tied to this logic so that a soft-disabled compression resulting in a chuck size of {1,1,1,1,1,1}, which tanked IO write performance.

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

Copy link
Collaborator

@Yurlungur Yurlungur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note: these chunk sizes are all global, which means they're the wrong shape for, e.g., face centered fields. When @lroberts36 new topological elements stuff is fully threaded through the I/O, we will need to make chunking per-field.

@pgrete pgrete changed the title Fix chunk size when compression is soft-disabled WIP: Fix chunk size when compression is soft-disabled Jul 11, 2023
@pgrete
Copy link
Collaborator Author

pgrete commented Jul 11, 2023

Back to WIP given that open questions around chunking.

@brtnfld your input would be appreciated.

The key hdf5 question is: What's the best chunking size based on experience?

By default we set the chunking to {1,1,1,nx3, nx2, nx1} which (IIRC) was originally motivated by being able to compress a full block.
Now with block sizes of 128^3 and larger (or even 64^3) this results in chunks of >1M and I've read that this is typically discouraged.
Would it be worth to add a target chunk size and then set the chunk dims accordingly (based on contiguous pieces of data)?

Another performance question is: What's the best (practice) set of parameters with respect to the interplay of

  • chunk size
  • cb_buffer_size
  • stripe size (on lustre)
  • stripe count (on lustre)
  • number of aggregators

under the assumption that we typically write >1GB per rank (given the current amounts of GPU memory available and that we use one rank per GPU).

@brtnfld
Copy link
Contributor

brtnfld commented Jul 14, 2023

Unfortunately, there is no set standard for optimal chunk size. I suggest starting with what is convenient and tuning that. I'm assuming you are using collective IO? If so, in terms of IO, the number of chunks should not matter for I/O to disk because HDF5 will combine those chunk writes to a single MPI write, and the performance of that write could be improved by using more aggregators. Many chunks could hurt the performance due to the metadata increase so you can increase the metadata cache at the expense of some memory increase.

You also might set your alignments (H5Pset_alignment) to a multiple or equal to the lustre stripe size and the chunk size in multiples of the stripe size, if possible.

For Frontier specifically, OLCF has said, in the past, that the file system is optimized for file-per-process. And we have found that using the default progressive file layout was not well suited (diplomatically speaking) for a single shared file. I've been mainly setting the Lustre parameters that OLCF suggested:

512 GB+ Ifs setstripe -c 8 -p capacity -S 16M
1 TB+ Ifs setstripe -c 16 -p capacity -S 16M
8 TB+ Ifs setstripe -c 64 -p capacity -S 16M
16 TB+ Ifs setstripe -c 128 -p capacity -S 16M

You should also investigate using the new subfiling HDF5 feature, allowing you to use the node-local storage on Frontier. I've been using subfiling to node-local storage and the h5fuse.sh tool from the source to create a single HDF5 on the global filesystem. We have yet to test parallel compression with subfiling, but we plan on doing that shortly. But it should work with the most recent changes made to HDF5.

If you have further issues, it would be good to get the Darshan logs.

@BenWibking
Copy link
Collaborator

Is this the best reference for the subfiling feature: https://github.com/HDFGroup/hdf5doc/blob/master/RFCs/HDF5_Library/VFD_Subfiling/user_guide/HDF5_Subfiling_VFD_User_s_Guide.pdf?

Is it necessary to combine the files in order to read them? Or should reading work as long as the reader also uses HDF5 built with subfiling?

@pgrete
Copy link
Collaborator Author

pgrete commented Jul 17, 2023

Thanks for the input. This is very helpful.

Unfortunately, there is no set standard for optimal chunk size. I suggest starting with what is convenient and tuning that. I'm assuming you are using collective IO? If so, in terms of IO, the number of chunks should not matter for I/O to disk because HDF5 will combine those chunk writes to a single MPI write, and the performance of that write could be improved by using more aggregators. Many chunks could hurt the performance due to the metadata increase so you can increase the metadata cache at the expense of some memory increase.

I wasn't aware the chunking won't have an impact (except for the Metadata). That's good to know and removes one parameter that'd need to be optimized.
Given that (for now) all our data (except for particles) is dominated by "blocks", what's your recommendation here in terms of chunks?
Blocks contain contiguous data data and sizes range from 16^3 to 256^3 each containing an arbitrary number of variables (where the data for each variable is contiguous).
In other words, our output format has dimension (n_blocks, n_vars, nx3, nx2, nx1) where data is contiguous for the last thee dimension (which is why we used a chunk size of (nx3,nx2,nx1) until now with n_blocks easily reaching thousands and n_vars several tens.

You also might set your alignments (H5Pset_alignment) to a multiple or equal to the lustre stripe size and the chunk size in multiples of the stripe size, if possible.

I'll give this a try.

For Frontier specifically, OLCF has said, in the past, that the file system is optimized for file-per-process. And we have found that using the default progressive file layout was not well suited (diplomatically speaking) for a single shared file. I've been mainly setting the Lustre parameters that OLCF suggested:

Interesting, that's the first machine I've come across for some time that recommends 'file-per-process'.

512 GB+ Ifs setstripe -c 8 -p capacity -S 16M 1 TB+ Ifs setstripe -c 16 -p capacity -S 16M 8 TB+ Ifs setstripe -c 64 -p capacity -S 16M 16 TB+ Ifs setstripe -c 128 -p capacity -S 16M

You should also investigate using the new subfiling HDF5 feature, allowing you to use the node-local storage on Frontier. I've been using subfiling to node-local storage and the h5fuse.sh tool from the source to create a single HDF5 on the global filesystem. We have yet to test parallel compression with subfiling, but we plan on doing that shortly. But it should work with the most recent changes made to HDF5.

At the moment I'm less concerned about compression and more concerned about total IO performance, so this looks worth a try.
Did I read correctly that it was only introduced with 1.14.0?
How do you use in on Frontier given that the available modules are somewhat older

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  cray-hdf5-parallel:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        cray-hdf5-parallel/1.12.1.5
        cray-hdf5-parallel/1.12.2.1
        cray-hdf5-parallel/1.12.2.3


Are you using a custom compiled version (and if yes, do I need to take special care when compiling one myself)?

If you have further issues, it would be good to get the Darshan logs.

I attached a log file from run with 256 processes (GPUs) writing data to a directory with 128 OSTs and a buffer size of 16M (no chunking or compression): pgrete_athenaPK_nochunk_id1377719-101627_7-13-12712-15700819957037524677_1.darshan.pdf

@brtnfld
Copy link
Contributor

brtnfld commented Jul 17, 2023

Is this the best reference for the subfiling feature: https://github.com/HDFGroup/hdf5doc/blob/master/RFCs/HDF5_Library/VFD_Subfiling/user_guide/HDF5_Subfiling_VFD_User_s_Guide.pdf?

Yes.

Is it necessary to combine the files in order to read them? Or should reading work as long as the reader also uses HDF5 built with subfiling?

No, you can read from the subfiles without combining them into an HDF5 file if you use the subfiling file driver.

@brtnfld
Copy link
Contributor

brtnfld commented Jul 17, 2023

Thanks for the input. This is very helpful.

Unfortunately, there is no set standard for optimal chunk size. I suggest starting with what is convenient and tuning that. I'm assuming you are using collective IO? If so, in terms of IO, the number of chunks should not matter for I/O to disk because HDF5 will combine those chunk writes to a single MPI write, and the performance of that write could be improved by using more aggregators. Many chunks could hurt the performance due to the metadata increase so you can increase the metadata cache at the expense of some memory increase.

I wasn't aware the chunking won't have an impact (except for the Metadata). That's good to know and removes one parameter that'd need to be optimized. Given that (for now) all our data (except for particles) is dominated by "blocks", what's your recommendation here in terms of chunks? Blocks contain contiguous data data and sizes range from 16^3 to 256^3 each containing an arbitrary number of variables (where the data for each variable is contiguous). In other words, our output format has dimension (n_blocks, n_vars, nx3, nx2, nx1) where data is contiguous for the last thee dimension (which is why we used a chunk size of (nx3,nx2,nx1) until now with n_blocks easily reaching thousands and n_vars several tens.

Is each rank writing a hyperslab section of nx3,nx2,nx1?

You also might set your alignments (H5Pset_alignment) to a multiple or equal to the lustre stripe size and the chunk size in multiples of the stripe size, if possible.

I'll give this a try.

For Frontier specifically, OLCF has said, in the past, that the file system is optimized for file-per-process. And we have found that using the default progressive file layout was not well suited (diplomatically speaking) for a single shared file. I've been mainly setting the Lustre parameters that OLCF suggested:

Interesting, that's the first machine I've come across for some time that recommends 'file-per-process'.

512 GB+ Ifs setstripe -c 8 -p capacity -S 16M 1 TB+ Ifs setstripe -c 16 -p capacity -S 16M 8 TB+ Ifs setstripe -c 64 -p capacity -S 16M 16 TB+ Ifs setstripe -c 128 -p capacity -S 16M
You should also investigate using the new subfiling HDF5 feature, allowing you to use the node-local storage on Frontier. I've been using subfiling to node-local storage and the h5fuse.sh tool from the source to create a single HDF5 on the global filesystem. We have yet to test parallel compression with subfiling, but we plan on doing that shortly. But it should work with the most recent changes made to HDF5.

At the moment I'm less concerned about compression and more concerned about total IO performance, so this looks worth a try. Did I read correctly that it was only introduced with 1.14.0? How do you use in on Frontier given that the available modules are somewhat older

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  cray-hdf5-parallel:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        cray-hdf5-parallel/1.12.1.5
        cray-hdf5-parallel/1.12.2.1
        cray-hdf5-parallel/1.12.2.3

Are you using a custom compiled version (and if yes, do I need to take special care when compiling one myself)?

Nothing special needs to be done when building HDF5. For subfiling, you need to enable it at build time.

BTW, I'm reviewing and answering HPC/subfiling questions at:
https://forum.hdfgroup.org/t/using-the-subfiling-h5fuse-sh-tool-join-scot-breitenfeld-on-call-the-doctor-tuesday-7-18/11329

If you have further issues, it would be good to get the Darshan logs.

I attached a log file from run with 256 processes (GPUs) writing data to a directory with 128 OSTs and a buffer size of 16M (no chunking or compression): pgrete_athenaPK_nochunk_id1377719-101627_7-13-12712-15700819957037524677_1.darshan.pdf

@BenWibking
Copy link
Collaborator

At the moment I'm less concerned about compression and more concerned about total IO performance, so this looks worth a try. Did I read correctly that it was only introduced with 1.14.0? How do you use in on Frontier given that the available modules are somewhat older

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  cray-hdf5-parallel:
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        cray-hdf5-parallel/1.12.1.5
        cray-hdf5-parallel/1.12.2.1
        cray-hdf5-parallel/1.12.2.3

Are you using a custom compiled version (and if yes, do I need to take special care when compiling one myself)?

Nothing special needs to be done when building HDF5. For subfiling, you need to enable it at build time.

The documentation says you have to add -DHDF5_ENABLE_SUBFILING_VFD=ON when building HDF5 1.14+ to enable subfiling. So presumably this means we can't use the prebuilt modules on Frontier.

@brtnfld
Copy link
Contributor

brtnfld commented Jul 18, 2023

Right, you won't be able to use the installed modules. Looking at the Darshan report, many small writes will kill performance, and I'm not sure where they are coming from. When you install 1.14, can you get another Darshan report? It looks like you are doing independent IO, so you could try collective and see if that is any better.

@Yurlungur
Copy link
Collaborator

@pgrete perhaps it is Params killing I/O performance, as those are attributes, which go down a different code path

@brtnfld
Copy link
Contributor

brtnfld commented Jul 18, 2023

Attributes get written to the heap, so they should be written as something other than small writes. It might be that we had issues with collective metadata in 1.12.

@pgrete
Copy link
Collaborator Author

pgrete commented Jul 21, 2023

Couple of additional infos:
a) I saw that there's a recording of your subfiling session. I put that on my watch list as I couldn't attend in person.
b) We are definitely doing collective IO (to be absolutely sure I even force enabled it). All the small writes (as far as I can tell at the moment) come from me piping the stdout to disk (for my own record). I can certainly run a job without.
c) Writing all the header/params/attributes is not the bottleneck. With the new profiling regions I turns out that >99% of the time is being spend in this function

void HDF5WriteND(hid_t location, const std::string &name, const T *data, int rank,
calling 3 HDF library functions , see e.g., (at a smaller scale)

TOP-DOWN TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
=================== 
|-> 1.16e+02 sec 87.7% 0.0% 0.0% 0.8% 2.16e-01 2 MakeOutputs [region]
|   |-> 6.58e+01 sec 49.9% 0.0% 0.3% 0.3% 0.00e+00 2 PHDF5::WriteOutputFileRealPrec [region]
|   |   |-> 6.55e+01 sec 49.7% 0.0% 0.0% 0.6% 0.00e+00 2 write all variable data [region]
|   |   |   |-> 6.51e+01 sec 49.4% 0.0% 0.3% 0.1% 0.00e+00 2 write variable loop [region]
|   |   |       |-> 6.47e+01 sec 49.1% 0.0% 0.4% 100.0% 0.00e+00 2 write variable data [region]
|   |   |       |-> 3.54e-01 sec 0.3% 0.0% 8.1% 100.0% 0.00e+00 2 fill host output buffer [region]
|   |-> 4.88e+01 sec 37.0% 0.0% 0.2% 0.6% 0.00e+00 2 PHDF5::WriteOutputFileSinglePrec [region]
|   |   |-> 4.85e+01 sec 36.8% 0.0% 0.0% 0.4% 0.00e+00 2 write all variable data [region]
|   |   |   |-> 4.83e+01 sec 36.6% 0.0% 0.2% 0.1% 0.00e+00 4 write variable loop [region]
|   |   |       |-> 4.79e+01 sec 36.3% 0.0% 0.2% 100.0% 0.00e+00 4 write variable data [region]
|   |   |       |-> 3.67e-01 sec 0.3% 0.0% 6.5% 100.0% 0.00e+00 4 fill host output buffer [region]
|-> 1.55e+01 sec 11.7% 87.2% 0.0% 0.2% 2.61e+02 1 Driver_Main [region]
|   |-> 1.53e+01 sec 11.6% 88.4% 1.0% 3.1% 2.65e+02 104 MultiStage_Step [region]

Note that the 2 in the line | | | |-> 6.47e+01 sec 49.1% 0.0% 0.4% 100.0% 0.00e+00 2 write variable data [region] means that this region is only called twice but accounts for 49.1% of the total runtime in the given example. Don't take those numbers as reference, these are just an example to indicate where the issue is.

@pgrete
Copy link
Collaborator Author

pgrete commented Jul 24, 2023

Here a more recent darhsan report from a larger job: darshan.pdf
2048 computing ranks writing files of 3.1T or 4.6T with 128OSTs (and writing ranks) without chunking or compression and stripe size of 16M.
Aggregate writing performance is ~140 GB/s (so it boils down to again ~1.1GB/s per writing rank).
As you can see IO walltime is also insignificant wrt compute walltime so it's not that big of an issue, but for other use cases the solution is likely subfiling (especially with larger output sizes as the number of OSTs should not be significantly increased further according to OLCF support).

@pgrete
Copy link
Collaborator Author

pgrete commented Jul 25, 2023

One more data point: Based on the assumption that there are too few ranks for too much data when using collective buffering, I disabled it via export MPICH_MPIIO_HINTS="*:romio_cb_write=disable".
Based on the walltime reported by Parthenon this decreased the time spent in IO by 2-3x ! (with the caveat that this doesn't really align with my interpretation of the darshan report, see new one here darshan_no_cbwrite.pdf

@brtnfld what's your take on this? Does this line up with your expectations? Is there any harm in disabling collective buffering for writes? Does this behavior point at something else?

@brtnfld
Copy link
Contributor

brtnfld commented Jul 27, 2023

I would expect that disabling romio_cb_write would hurt performance, but no harm if doing so helps the I/O performance.

@roblatham00
Copy link

Hey @brtnfld tagged me on this discussion. I've also seen cray's HDF5 with oddly high amounts of small independent writes, despite me asking for (and confirming via property list inspection) collective I/O.

I'd like to take a closer look at your darshan log: can you grant 'robl' permissions to read it on Frontier? If you chmod o+r your darshan logs, I can tell you a bit more about what's going on here.

@pgrete
Copy link
Collaborator Author

pgrete commented Jul 28, 2023

Thanks for taking a closer look @roblatham00
I adjusted the read permission:

  • with collective writes: /lustre/orion/darshan/frontier/2023/7/23/pgrete_athenaPK_nochunk_id1383704-40294_7-23-83080-10520148136350124180_1.darshan
  • without collective writes: /lustre/orion/darshan/frontier/2023/7/25/pgrete_athenaPK_nochunk_id1385124-118009_7-25-47993-18003036132098865802_2.darshan

@roblatham00
Copy link

There must be more going on in these two runs than just disabling MPI-IO collectives, right? I see quite different command lines

Here's a new-ish darshan feature called "heatmaps" which show time of accesses, intensity of accesses, and which ranks are involved. They can be a bit dense.

The fast run only makes one MPI-IO call --
image

the slow run makes several MPI-IO calls:
image

Other things that stand out in this darshan log:

  • both configurations would benefit from collective metadata
  • even the fast case is getting slammed by many many tiny writes:
Access Size | Count
-- | --
20 | 884736
66 | 688128
19 | 688128
179 | 294912

Unfortunately these darshan logs do not contain Lustre data or HDF5 module information. I guess OLCF did not enable those features.

You mentioned that I/O time is not your main concern right now, and I agree. Since I"m thinking about I/O tuning, I'll add this here for reference:

These access patterns suggest something else is going on here. There are a few cray-mpich tuning knobs that are supposed to work really well for lustre If I had seen some more sensible access patterns, but we wanted to improve performance, here's my 2023 list of tuning parameters specific to cray-mpich and Lustre:

  • cray_cb_write_lock_mode -- "2" enables lockahead
  • cray_cb_nodes_multiplier -- "2" or more but highly pattern dependent. Instead of one single MPI-IO process per aggregator node, run a few more (idea being you can get a bit more network activity between node and storgae).
  • striping_factor -- progressive file layout is a suckers game. If you forget to increase the stripe count, PFL will prevent you from writing a terabyte of data through one OST. that's about it. If you know you're going to use parallel I/O I would lfs setstripe to -1 and get as many OSTs involved

@roblatham00
Copy link

Oh, now I see a bit more about why the darshan logs have been hard to interpret and why so many tiny writes show up.... there's some kind of compiler step going on? The XDMF files also have a strange access pattern, but that accounts for only a few seconds. I see 1600+ files opened collectively and a whole bunch of files opened individually, so there is a lot going on.

If I look only at .phdf and .rhdf5 files, the accesses look more like what i'd expect to see.

now there is one strange thing: let's take a closer look at slow-mode parthenon.restart.final.rhdf:

  • POSIX_F_MAX_WRITE_TIME 2.187628
  • MPIIO_F_MAX_WRITE_TIME 101.936236

some process spent 100 seconds in collective writes, but only 2 seconds of that time was writing. The remaining 99 seconds? That suggests either high amounts of load imbalance (some process entered the collective say 90 seconds after everyone else did) or the two-phase aggregation is taking a lot longer than expected.

With collective I/O disabled, there isn't a whole lot the MPI-IO layer does, and the darshan timings reflect this:

  • POSIX_F_MAX_WRITE_TIME 13.467156
  • MPIIO_F_MAX_WRITE_TIME 14.358698

OK, so now we're back to the original question: why are the collectives so slow? We're all new to progressive file layout but these 4 TiB files should have some seriously large lustre stripe counts. Scot's suggested stripe counts seem like a good start but there are 1350 lustre server processes running on 450 storage nodes (three lustre servers per node).

Now I"ve convinced myself we really do want to use those tuning parameters I mentioned in my last post after all , particularly cray_cb_write_lock_mode and cray_cb_nodes_multiplier . i also don't know how cb_nodes -- the number of I/O proxies -- is selected in the presence of progressive file layout. One more reason to just set the stripe count to something enormous.

@pgrete
Copy link
Collaborator Author

pgrete commented Sep 29, 2023

Thanks for the detailed analysis.
I now got (or was forced to) look at this in more detail again as we got burned at larger scale again (spending ~15 minutes writing a 2.7TB file across ~2000k nodes).

It looks like hdf5 chunking is the key piece that kills performance for us.
In the given output the raw data is 17088x10x128x128x128 with chunk size originally set to 1x1x128x128x128 (as 128x128x128 is the block size containing information of a single component of a variable vector with 10 components).

I was able to run a smaller interactive job (~500 nodes) and go the following results:

  • baseline (hdf5 chunking enabled, 128 OST, block size 16M, MPIIO collective writes disabled): 253 seconds
> <average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
> =================== 
> |   |-> 2.68e+02 sec 74.5% 0.0% 0.8% 0.1% 0.00e+00 1 PHDF5::WriteOutputFileRealPrec [region]
> |   |   |-> 2.53e+02 sec 70.6% 0.0% 0.0% 0.1% 0.00e+00 1 write all variable data [region]         <- for reference this is where the 2.7TB of data are being written. Everything else is (our) metadata/attributes/properties
> |   |   |   |-> 2.53e+02 sec 70.5% 0.0% 0.0% 0.0% 0.00e+00 1 write variable loop [region]
> |   |   |       |-> 2.53e+02 sec 70.4% 0.0% 0.0% 100.0% 0.00e+00 1 write variable data [region]
> |   |   |-> 9.53e+00 sec 2.7% 0.0% 32.6% 0.0% 0.00e+00 1 write Attributes [region]
> |   |   |   |-> 9.43e+00 sec 2.6% 0.0% 34.0% 100.0% 0.00e+00 1 write Info [region]
> |   |   |-> 4.02e+00 sec 1.1% 0.0% 81.0% 100.0% 0.00e+00 1 write block metadata [region]
> |   |   |-> 5.81e-01 sec 0.2% 0.0% 0.0% 100.0% 0.00e+00 1 write mesh coords [region]
  • hdf5 chunking disabled, 128 OST, block size 16M, MPIIO collective writes disabled: 8.87 seconds
> <average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
> =================== 
> |   |-> 2.34e+01 sec 30.8% 0.0% 9.2% 1.1% 0.00e+00 1 PHDF5::WriteOutputFileRealPrec [region]
> |   |   |-> 9.49e+00 sec 12.5% 0.0% 34.2% 0.0% 0.00e+00 1 write Attributes [region]
> |   |   |   |-> 9.49e+00 sec 12.5% 0.0% 34.2% 100.0% 0.00e+00 1 write Info [region]
> |   |   |-> 8.87e+00 sec 11.7% 0.0% 0.0% 2.3% 0.00e+00 1 write all variable data [region]
> |   |   |   |-> 8.67e+00 sec 11.4% 0.0% 0.1% 0.4% 0.00e+00 1 write variable loop [region]
> |   |   |       |-> 8.44e+00 sec 11.1% 0.0% 0.4% 100.0% 0.00e+00 1 write variable data [region]
> |   |   |       |-> 1.99e-01 sec 0.3% 0.0% 9.5% 100.0% 0.00e+00 1 fill host output buffer [region]
> |   |   |-> 4.16e+00 sec 5.5% 0.0% 74.2% 100.0% 0.00e+00 1 write block metadata [region]
> |   |   |-> 5.87e-01 sec 0.8% 0.0% 0.0% 100.0% 0.00e+00 1 write mesh coords [region]
  • hdf5 chunking disabled, 128 OST, block size 16M, MPIIO collective writes enabled: 23.4 seconds (so collective buffering is slower than without)
> <average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
> =================== 
> |   |-> 3.67e+01 sec 41.8% 0.0% 6.2% 0.7% 0.00e+00 1 PHDF5::WriteOutputFileRealPrec [region]
> |   |   |-> 2.34e+01 sec 26.6% 0.0% 0.0% 0.9% 0.00e+00 1 write all variable data [region]
> |   |   |   |-> 2.32e+01 sec 26.4% 0.0% 0.0% 0.1% 0.00e+00 1 write variable loop [region]
> |   |   |       |-> 2.30e+01 sec 26.1% 0.0% 0.2% 100.0% 0.00e+00 1 write variable data [region]
> |   |   |       |-> 1.99e-01 sec 0.2% 0.0% 8.8% 100.0% 0.00e+00 1 fill host output buffer [region]
> |   |   |-> 9.28e+00 sec 10.6% 0.0% 34.6% 0.0% 0.00e+00 1 write Attributes [region]
> |   |   |   |-> 9.28e+00 sec 10.6% 0.0% 34.6% 100.0% 0.00e+00 1 write Info [region]
> |   |   |-> 3.30e+00 sec 3.8% 0.0% 95.4% 100.0% 0.00e+00 1 write block metadata [region]
> |   |   |-> 4.29e-01 sec 0.5% 0.0% 0.0% 100.0% 0.00e+00 1 write mesh coords [region]
  • hdf5 chunking disabled, 256 OST, block size 8M, MPIIO collective writes disabled: 5.83 seconds (so increasing number of OST improves write performance further)
> |   |-> 2.05e+01 sec 30.5% 0.0% 13.0% 1.3% 0.00e+00 1 PHDF5::WriteOutputFileRealPrec [region]
> |   |   |-> 9.79e+00 sec 14.6% 0.0% 34.1% 0.0% 0.00e+00 1 write Attributes [region]
> |   |   |   |-> 9.79e+00 sec 14.6% 0.0% 34.1% 100.0% 0.00e+00 1 write Info [region]
> |   |   |-> 5.83e+00 sec 8.7% 0.0% 0.0% 3.5% 0.00e+00 1 write all variable data [region]
> |   |   |   |-> 5.63e+00 sec 8.4% 0.0% 0.1% 0.6% 0.00e+00 1 write variable loop [region]
> |   |   |       |-> 5.40e+00 sec 8.0% 0.0% 0.5% 100.0% 0.00e+00 1 write variable data [region]
> |   |   |       |-> 2.00e-01 sec 0.3% 0.0% 8.9% 100.0% 0.00e+00 1 fill host output buffer [region]
> |   |   |-> 4.27e+00 sec 6.3% 0.0% 76.9% 100.0% 0.00e+00 1 write block metadata [region]
> |   |   |-> 3.33e-01 sec 0.5% 0.0% 0.0% 100.0% 0.00e+00 1 write mesh coords [region]

Any idea why chunking has such a dramatic impact on performance?
Ideally, we'd like to keep chunking in order to also use compression eventually.

@roblatham00
Copy link

Chunked reads at scale are performance-broken in HDF5 ( see HDFGroup/hdf5#2658 -- independent reads sneak in despite me asking HDF5 in every way I know how to behave collectively) but I don't know of a similar scaling problem for writes off the top of my head..

Could you once again grant me permissions to read these darshan logs:

/lustre/orion/darshan/frontier/2023/9/29/pgrete_athenaPK_id1451288-50146_9-29-22978-18340123443611133723_2.darshan
/lustre/orion/darshan/frontier/2023/9/29/pgrete_athenaPK_id1451288-50353_9-29-23543-6893089715697044285_2.darshan
/lustre/orion/darshan/frontier/2023/9/29/pgrete_athenaPK_id1451288-50805_9-29-24266-4252811279667951939_1.darshan
/lustre/orion/darshan/frontier/2023/9/29/pgrete_athenaPK_id1451288-50987_9-29-24473-12871652074578960594_3.darshan

@pgrete
Copy link
Collaborator Author

pgrete commented Sep 29, 2023

Chunked reads at scale are performance-broken in HDF5 ( see HDFGroup/hdf5#2658 -- independent reads sneak in despite me asking HDF5 in every way I know how to behave collectively) but I don't know of a similar scaling problem for writes off the top of my head..

I noticed that reading the non-chunked data was faster than the chunked data, though the difference was not too dramatic/prohibitive.

Could you once again grant me permissions to read these darshan logs:

/lustre/orion/darshan/frontier/2023/9/29/pgrete_athenaPK_id1451288-50146_9-29-22978-18340123443611133723_2.darshan
/lustre/orion/darshan/frontier/2023/9/29/pgrete_athenaPK_id1451288-50353_9-29-23543-6893089715697044285_2.darshan
/lustre/orion/darshan/frontier/2023/9/29/pgrete_athenaPK_id1451288-50805_9-29-24266-4252811279667951939_1.darshan
/lustre/orion/darshan/frontier/2023/9/29/pgrete_athenaPK_id1451288-50987_9-29-24473-12871652074578960594_3.darshan

Done. The files are in order with the numbers I reported above.

@roblatham00
Copy link

Some observations: these darshan logs don't have HDF5 or Lustre information but still tell a tale. There are very many files in the darshan log but I'll focus on the biggest (parthenon.restart.final.rhdf)

Case 1: baseline (hdf5 chunking enabled, 128 OST, block size 16M, MPIIO collective writes disabled): 253 seconds

  • I see ten million writes of size 512k.
  • I also see an equal amount of bytes read as written
  • There is an MPI-IO optimzation called "data sieving" where two things happen: 1 -- each MPI-IO client reads 512 bytes, modifies that buffer, then writes it out 2 -- each MPI-IO client locks the file to prevent concurrent updates.
  • the slowest POSIX write is 6 seconds while the slowest MPI-IO write (which wraps those POSIX calls) is 252 seconds. you disabled collective i/o so this time is all waiting for locks.

I'm surprised this only took 200-some seconds.

Case 2: hdf5 chunking disabled, 128 OST, block size 16M, MPIIO collective writes disabled: 8.87 seconds

  • we still have a small amount of read-modify-write happening, but the total bytes read is 100000x less than total bytes written
  • most common access size is 4k (12000 times)
  • slowest POSIX write is 8 seconds while slowest MPI-IO write is also about 8 seconds. so MPI-IO calls are dispatched directly to POSIX calls. no noncontiguous optimizations happening here

Case 3: "Ok, I turned on collective I/O and got 3x worse performance" hdf5 chunking disabled, 128 OST, block size 16M, MPIIO collective writes enabled: 23.4 seconds (so collective buffering is slower than without)

  • why are there operations on a file in the 'cbwriteoff' directory if this is the "collective i/o enabled" case?
  • still a little bit of reads here.
  • lots of 16 MiB writes which is what i'd expect from collective buffering

I guess this is what I was saying over the summer, and confirmed by your 128 OST vs 256 OST experiment: just not enough aggregators lighting up the storage system. you have 500 nodes and 4272 processes, but this MPI-IO driver will only use 128 of them to talk to the file system. You don't want all 4272 processes blasting away and tripping over themselves, but you probably do want 1 or 2 per node. This is where the "cb_lock_mode" and "cb_nodes_multipllier" hints are supposed to help. Or crank up the stripe count to 1000.

Have you run a "chunking on / collective on" configuration ?

@pgrete
Copy link
Collaborator Author

pgrete commented Oct 2, 2023

Thanks for the detailed analysis.
I got another interactive job and did some further testing (though today everything seemed to be a lot slower).
Darshan logfile should be world readable (/lustre/orion/darshan/frontier/2023/10/2/pgrete_athenaPK_id1453348-10*) and again in order (slightly related: the difference for the number of reads you noticed above is related to a change in input file. I switched to reading one that did not include chunking as in increased startup/read performance -- which is in agreement with the other issue you posted).

For the new tests (I now also collected the MPI IO config as reported when writing the file)G:

  • hdf5 chunking enabled, 128 OST, block size 16M, MPIIO collective writes enabled): 1180 seconds
|   |-> 1.18e+03 sec 96.7% 0.0% 0.2% 0.0% 0.00e+00 1 PHDF5::WriteOutputFileRealPrec [region]
|   |   |-> 1.16e+03 sec 95.5% 0.0% 0.0% 0.0% 0.00e+00 1 write all variable data [region]
|   |   |   |-> 1.16e+03 sec 95.5% 0.0% 0.0% 0.0% 0.00e+00 1 write variable loop [region]
|   |   |       |-> 1.16e+03 sec 95.5% 0.0% 0.0% 100.0% 0.00e+00 1 write variable data [region]
|   |   |-> 1.02e+01 sec 0.8% 0.0% 34.7% 0.0% 0.00e+00 1 write Attributes [region]
|   |   |   |-> 1.02e+01 sec 0.8% 0.0% 34.8% 100.0% 0.00e+00 1 write Info [region]
|   |   |-> 3.64e+00 sec 0.3% 0.0% 94.4% 100.0% 0.00e+00 1 write block metadata [region]

PE 0: MPIIO hints for parthenon.restart.final.rhdf:
          romio_cb_pfr             = disable
          romio_cb_fr_types        = aar
          cb_align                 = 2
          cb_buffer_size           = 16777216
          romio_cb_fr_alignment    = 1
          romio_cb_ds_threshold    = 0
          romio_cb_alltoall        = automatic
          romio_cb_read            = automatic
          romio_cb_write           = automatic
          romio_no_indep_rw        = false
          romio_ds_write           = automatic
          ind_wr_buffer_size       = 524288
          romio_ds_read            = disable
          ind_rd_buffer_size       = 4194304
          direct_io                = false
          striping_factor          = 128
          striping_unit            = 16777216
          romio_lustre_start_iodevice = -1
          aggregator_placement_stride = -1
          abort_on_rw_error        = disable
          cb_config_list           = *:*
          cb_nodes                 = 128
          cray_fileoff_based_aggr  = false
          romio_filesystem_type    = CRAY ADIO:

  • hdf5 chunking disabled, 256 OST, block size 8M, MPIIO collective writes enabled, cray_cb_write_lock_mode=2 cray_cb_nodes_multiplier=2: 536 second
|   |-> 5.51e+02 sec 94.7% 0.0% 0.4% 0.0% 0.00e+00 1 PHDF5::WriteOutputFileRealPrec [region]
|   |   |-> 5.36e+02 sec 92.2% 0.0% 0.0% 0.0% 0.00e+00 1 write all variable data [region]
|   |   |   |-> 5.36e+02 sec 92.1% 0.0% 0.0% 0.0% 0.00e+00 1 write variable loop [region]
|   |   |       |-> 5.36e+02 sec 92.1% 0.0% 0.0% 100.0% 0.00e+00 1 write variable data [region]
|   |   |-> 1.04e+01 sec 1.8% 0.0% 35.4% 0.0% 0.00e+00 1 write Attributes [region]
|   |   |   |-> 1.04e+01 sec 1.8% 0.0% 35.4% 100.0% 0.00e+00 1 write Info [region]
|   |   |-> 3.81e+00 sec 0.7% 0.0% 99.4% 100.0% 0.00e+00 1 write block metadata [region]

PE 0: MPIIO hints for parthenon.restart.final.rhdf:
          romio_cb_pfr             = disable
          romio_cb_fr_types        = aar
          cb_align                 = 2
          cb_buffer_size           = 16777216
          romio_cb_fr_alignment    = 1
          romio_cb_ds_threshold    = 0
          romio_cb_alltoall        = automatic
          romio_cb_read            = automatic
          romio_cb_write           = automatic
          romio_no_indep_rw        = false
          romio_ds_write           = automatic
          ind_wr_buffer_size       = 524288
          romio_ds_read            = disable
          ind_rd_buffer_size       = 4194304
          direct_io                = false
          striping_factor          = 256
          striping_unit            = 8388608
          romio_lustre_start_iodevice = -1
          aggregator_placement_stride = -1
          abort_on_rw_error        = disable
          cb_config_list           = *:*
          cray_cb_nodes_multiplier = 2
          cray_cb_write_lock_mode  = 2
          cray_fileoff_based_aggr  = false
          romio_filesystem_type    = CRAY ADIO:
          cb_nodes                 = 512

  • hdf5 chunking disabled, 256 OST, block size 8M, MPIIO collective writes disabled, cray_cb_write_lock_mode=2 cray_cb_nodes_multiplier=2: 286 second
|   |-> 2.86e+02 sec 91.2% 0.0% 0.8% 0.1% 0.00e+00 1 PHDF5::WriteOutputFileRealPrec [region]
|   |   |-> 2.70e+02 sec 86.1% 0.0% 0.0% 0.1% 0.00e+00 1 write all variable data [region]
|   |   |   |-> 2.70e+02 sec 86.0% 0.0% 0.0% 0.0% 0.00e+00 1 write variable loop [region]
|   |   |       |-> 2.70e+02 sec 86.0% 0.0% 0.0% 100.0% 0.00e+00 1 write variable data [region]
|   |   |-> 1.02e+01 sec 3.3% 0.0% 35.4% 0.0% 0.00e+00 1 write Attributes [region]
|   |   |   |-> 1.02e+01 sec 3.3% 0.0% 35.4% 100.0% 0.00e+00 1 write Info [region]
|   |   |-> 4.90e+00 sec 1.6% 0.0% 96.8% 100.0% 0.00e+00 1 write block metadata [region]


PE 0: MPIIO hints for parthenon.restart.final.rhdf:
          romio_cb_pfr             = disable
          romio_cb_fr_types        = aar
          cb_align                 = 2
          cb_buffer_size           = 16777216
          romio_cb_fr_alignment    = 1
          romio_cb_ds_threshold    = 0
          romio_cb_alltoall        = automatic
          romio_cb_read            = automatic
          romio_cb_write           = disable
          romio_no_indep_rw        = false
          romio_ds_write           = automatic
          ind_wr_buffer_size       = 524288
          romio_ds_read            = disable
          ind_rd_buffer_size       = 4194304
          direct_io                = false
          striping_factor          = 256
          striping_unit            = 8388608
          romio_lustre_start_iodevice = -1
          aggregator_placement_stride = -1
          abort_on_rw_error        = disable
          cb_config_list           = *:*
          cray_cb_nodes_multiplier = 2
          cray_cb_write_lock_mode  = 2
          cray_fileoff_based_aggr  = false
          romio_filesystem_type    = CRAY ADIO:
          cb_nodes                 = 512
  • hdf5 chunking disabled, 256 OST, block size 8M, MPIIO collective writes disabled, cray_cb_write_lock_mode=unset cray_cb_nodes_multiplier=unset2: 274 second
|   |-> 2.90e+02 sec 91.3% 0.0% 0.7% 0.1% 0.00e+00 1 PHDF5::WriteOutputFileRealPrec [region]
|   |   |-> 2.74e+02 sec 86.4% 0.0% 0.0% 0.1% 0.00e+00 1 write all variable data [region]
|   |   |   |-> 2.74e+02 sec 86.3% 0.0% 0.0% 0.0% 0.00e+00 1 write variable loop [region]
|   |   |       |-> 2.74e+02 sec 86.3% 0.0% 0.0% 100.0% 0.00e+00 1 write variable data [region]
|   |   |-> 1.03e+01 sec 3.2% 0.0% 36.1% 0.0% 0.00e+00 1 write Attributes [region]
|   |   |   |-> 1.03e+01 sec 3.2% 0.0% 36.1% 100.0% 0.00e+00 1 write Info [region]
|   |   |-> 4.80e+00 sec 1.5% 0.0% 76.3% 100.0% 0.00e+00 1 write block metadata [region]
|   |   |-> 3.52e-01 sec 0.1% 0.0% 0.1% 100.0% 0.00e+00 1 write mesh coords [region]

PE 0: MPIIO hints for parthenon.restart.final.rhdf:
          romio_cb_pfr             = disable
          romio_cb_fr_types        = aar
          cb_align                 = 2
          cb_buffer_size           = 16777216
          romio_cb_fr_alignment    = 1
          romio_cb_ds_threshold    = 0
          romio_cb_alltoall        = automatic
          romio_cb_read            = automatic
          romio_cb_write           = disable
          romio_no_indep_rw        = false
          romio_ds_write           = automatic
          ind_wr_buffer_size       = 524288
          romio_ds_read            = disable
          ind_rd_buffer_size       = 4194304
          direct_io                = false
          striping_factor          = 256
          striping_unit            = 8388608
          romio_lustre_start_iodevice = -1
          aggregator_placement_stride = -1
          abort_on_rw_error        = disable
          cb_config_list           = *:*
          cb_nodes                 = 256
          cray_fileoff_based_aggr  = false
          romio_filesystem_type    = CRAY ADIO:

The latter case is the identical setup to the last case of the previous set.
I don't get why it took almost 5 minutes this time, when it only took <6 seconds last time...

@roblatham00
Copy link

First off, sorry for ghosting for the last couple weeks: was getting a workshop presentation together.

Let's start with "collective + chunking" -- the configuration that should be pretty fast: base-stripe128-bs16-cbwriteon/parthenon.restart.final.rhdf is the file of interest, right? the slowest writer took 1163.095682 seconds at the MPI-IO level and only 7.043302 seconds at the POSIX level. something is indeed weird there....

The Darshan log shows what we'd expect: a lot of MPI-IO requests getting transformed into nicer Lustre writes.

The log doesn't confirm how many OSTs are involved. I'll have to trust you that it really is 128.

There are a lot of 512 byte independent reads at the MPI-IO level: that looks a lot like HDF5 updating the file format. I don't know how easy it would be to add a few property lists to your HDF5 reader and writer, but if you can add calls to H5Pset_all_coll_metadata_ops and H5Pset_coll_metadata_write I bet that would reduce the number of operations and might improve performance.

I'd also like to see the output of some MPI-IO diagnostics: if you look at intro_mpi (https://cpe.ext.hpe.com/docs/mpt/mpich/intro_mpi.html#mpi-io-environment-variables) there are a lot of values we can play with but to start I'd like to see what a run with these environment variables set says:

MPICH_MPIIO_STATS=1
MPICH_MPIIO_TIMERS=1
MPICH_MPIIO_TIME_WAITS=1
MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY=1
MPICH_MPIIO_HINTS_DISPLAY=1

this is going to be a fair bit of output. stash it somewhere on frontier if that's an easier way to get me the data.

@roblatham00
Copy link

roblatham00 commented Oct 18, 2023

The darshan heatmap is not usually a great tool when lots of files are involved, but in this case the time is so dominated by the MPI-IO time that we see weirdness clearly:

MPI-IO heatmap:

image

POSIX heatmap:

image

I've seen something like this on Polaris, a Slingshot-10 system, and it is not supposed to behave like this on SS-11. I went back and forth with Cray for a few days before we figured out that the "on demand" connection establishing in SS-10 was taking "forever" (47 seconds in my case).

Despite the man page insisting "this is not beneficial on Slingshot-11", could you also try setting MPICH_OFI_STARTUP_CONNECT=1 in your environment?

@roblatham00
Copy link

Wow, look at this POSIX heatmap from the second (108095) case (hdf5 chunking disabled, 256 OST, block size 8M, MPIIO collective writes enabled, cray_cb_write_lock_mode=2 cray_cb_nodes_multiplier=2: 536 second)

POSIX heatmap

image

it might be hard to read but I see almost no parallelism at the POSIX level.

Can you tell me a bit about how the code is writing HDF5? For example, how many datasets are you creating/writing? How are those datasets decomposed across the processes?

@pgrete
Copy link
Collaborator Author

pgrete commented Nov 27, 2023

Following up on this thread: After I got sidetracked with proposal writing, I should have more cycles to look at this issue again in more detail.
Coincidentally, I am basically able to reproduce the issue on LUMI (same node architecture and also Lustre filesystem).

To answer your questions: The setup in question really only write a single dataset with dimensions num_blocks x num_variables x num_cells_z x num_cells_y x num_cells z.
A large-ish example would be 17088x10x128x128x128 i.e., 17088 blocks (which are spread over ranks) each containing 10 variables with 128^3 double precision floats.
Each process first copes the data from device memory to a contiguous buffer in host memory and write that data with a single call to the hdf5 lib (the latter is where all the time is being spent).

I'll try to get some more detailed data with the output vars you mention above over the next days and also start experimenting with subfiling and/or other output formats.
For the project on Frontier we can live with the current output times at the scale we're running but for the other project on LUMI (and future projects) this is a road blocker.

@roblatham00
Copy link

Thanks for your persistence. I'm trying to determine if this is a defect in Cray-MPICH or if there are some tuning parameters we should adjust (the defaults are getting a bit long in the tooth)

@pgrete
Copy link
Collaborator Author

pgrete commented Nov 28, 2023

Alright, I got some more data -- though at smaller scale but with the same characteristics, i..e, 64 nodes with 512 ranks and each rank handling 8 blocks with 9 variables with 128^3 cells, e.g.,

   DATASET "cons" {
      DATATYPE  H5T_IEEE_F64LE
      DATASPACE  SIMPLE { ( 4096, 9, 128, 128, 128 ) / ( 4096, 9, 128, 128, 128 ) }
      DATA

Again I tested

  • default config (no additional env or filesystem vars)
  • setting stripe count to 64 and size to 8M
  • additionally disabling collective buffering

Also I tested using the default cray-hdf5-parallel and a custom compiled 1.14.3 (no noticeable performance difference on first sight).
Darshan files are located here:

-r--r--r-- 1 pgrete darshan 665K Nov 28 10:50 /lustre/orion/darshan/frontier/2023/11/28/pgrete_athenaPK_id1511253-90250_11-28-38768-10958730481004483118_5.darshan
-r--r--r-- 1 pgrete darshan 684K Nov 28 10:56 /lustre/orion/darshan/frontier/2023/11/28/pgrete_athenaPK_id1511253-90401_11-28-39336-9906740915630473226_1.darshan
-r--r--r-- 1 pgrete darshan 731K Nov 28 10:59 /lustre/orion/darshan/frontier/2023/11/28/pgrete_athenaPK_id1511253-90538_11-28-39534-295903521802282093_1.darshan
-r--r--r-- 1 pgrete darshan 663K Nov 28 11:14 /lustre/orion/darshan/frontier/2023/11/28/pgrete_athenaPK_id1511253-90734_11-28-40152-5254559177170393290_4.darshan
-r--r--r-- 1 pgrete darshan 680K Nov 28 11:22 /lustre/orion/darshan/frontier/2023/11/28/pgrete_athenaPK_id1511253-90900_11-28-40931-5846135772140955634_1.darshan
-r--r--r-- 1 pgrete darshan 730K Nov 28 11:24 /lustre/orion/darshan/frontier/2023/11/28/pgrete_athenaPK_id1511253-91028_11-28-41050-10408317184376518657_1.darshan

(the first three being the the three variations above with cray-hdf5 and the last three the ones with hdf5 1.14.3).

And here are the outputs containing the additional logs from MPICH_MPIIO

ath.out.default.extra-logs.txt
ath.out.stripe64_buf8M.extra-logs.txt
ath.out.stripe64_buf8M_cbwriteNO.extra-logs.txt
ath.out.defaults-hdf1.14.3.extra-logs.txt
ath.out.stripe64_buf8M_cbwriteNO-hdf1.14.3.extra-logs.txt
ath.out.stripe64_buf8M-hdf1.14.3.extra-logs.txt

@roblatham00
Copy link

I appreciate the extra logs but they did not contain any of the MPI-IO level logging I was expecting. Is there a separate log for error output?

@roblatham00
Copy link

If you need a stripe count of 64 for some reason, then by all means disable collective I/O if you're getting good enough performance out of it. I do think a stripe count of 300 or more would get you better performance in all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants