-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Fix chunk size when compression is soft-disabled #905
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note: these chunk sizes are all global, which means they're the wrong shape for, e.g., face centered fields. When @lroberts36 new topological elements stuff is fully threaded through the I/O, we will need to make chunking per-field.
Back to WIP given that open questions around chunking. @brtnfld your input would be appreciated. The key hdf5 question is: What's the best chunking size based on experience? By default we set the chunking to {1,1,1,nx3, nx2, nx1} which (IIRC) was originally motivated by being able to compress a full block. Another performance question is: What's the best (practice) set of parameters with respect to the interplay of
under the assumption that we typically write >1GB per rank (given the current amounts of GPU memory available and that we use one rank per GPU). |
Unfortunately, there is no set standard for optimal chunk size. I suggest starting with what is convenient and tuning that. I'm assuming you are using collective IO? If so, in terms of IO, the number of chunks should not matter for I/O to disk because HDF5 will combine those chunk writes to a single MPI write, and the performance of that write could be improved by using more aggregators. Many chunks could hurt the performance due to the metadata increase so you can increase the metadata cache at the expense of some memory increase. You also might set your alignments (H5Pset_alignment) to a multiple or equal to the lustre stripe size and the chunk size in multiples of the stripe size, if possible. For Frontier specifically, OLCF has said, in the past, that the file system is optimized for file-per-process. And we have found that using the default progressive file layout was not well suited (diplomatically speaking) for a single shared file. I've been mainly setting the Lustre parameters that OLCF suggested: 512 GB+ Ifs setstripe -c 8 -p capacity -S 16M You should also investigate using the new subfiling HDF5 feature, allowing you to use the node-local storage on Frontier. I've been using subfiling to node-local storage and the h5fuse.sh tool from the source to create a single HDF5 on the global filesystem. We have yet to test parallel compression with subfiling, but we plan on doing that shortly. But it should work with the most recent changes made to HDF5. If you have further issues, it would be good to get the Darshan logs. |
Is this the best reference for the subfiling feature: https://github.com/HDFGroup/hdf5doc/blob/master/RFCs/HDF5_Library/VFD_Subfiling/user_guide/HDF5_Subfiling_VFD_User_s_Guide.pdf? Is it necessary to combine the files in order to read them? Or should reading work as long as the reader also uses HDF5 built with subfiling? |
Thanks for the input. This is very helpful.
I wasn't aware the chunking won't have an impact (except for the Metadata). That's good to know and removes one parameter that'd need to be optimized.
I'll give this a try.
Interesting, that's the first machine I've come across for some time that recommends 'file-per-process'.
At the moment I'm less concerned about compression and more concerned about total IO performance, so this looks worth a try.
Are you using a custom compiled version (and if yes, do I need to take special care when compiling one myself)?
I attached a log file from run with 256 processes (GPUs) writing data to a directory with 128 OSTs and a buffer size of 16M (no chunking or compression): pgrete_athenaPK_nochunk_id1377719-101627_7-13-12712-15700819957037524677_1.darshan.pdf |
Yes.
No, you can read from the subfiles without combining them into an HDF5 file if you use the subfiling file driver. |
Is each rank writing a hyperslab section of nx3,nx2,nx1?
Nothing special needs to be done when building HDF5. For subfiling, you need to enable it at build time. BTW, I'm reviewing and answering HPC/subfiling questions at:
|
The documentation says you have to add |
Right, you won't be able to use the installed modules. Looking at the Darshan report, many small writes will kill performance, and I'm not sure where they are coming from. When you install 1.14, can you get another Darshan report? It looks like you are doing independent IO, so you could try collective and see if that is any better. |
@pgrete perhaps it is Params killing I/O performance, as those are attributes, which go down a different code path |
Attributes get written to the heap, so they should be written as something other than small writes. It might be that we had issues with collective metadata in 1.12. |
Couple of additional infos: parthenon/src/outputs/parthenon_hdf5.hpp Line 144 in 94caccd
Note that the |
Here a more recent darhsan report from a larger job: darshan.pdf |
One more data point: Based on the assumption that there are too few ranks for too much data when using collective buffering, I disabled it via @brtnfld what's your take on this? Does this line up with your expectations? Is there any harm in disabling collective buffering for writes? Does this behavior point at something else? |
I would expect that disabling romio_cb_write would hurt performance, but no harm if doing so helps the I/O performance. |
Hey @brtnfld tagged me on this discussion. I've also seen cray's HDF5 with oddly high amounts of small independent writes, despite me asking for (and confirming via property list inspection) collective I/O. I'd like to take a closer look at your darshan log: can you grant 'robl' permissions to read it on Frontier? If you |
Thanks for taking a closer look @roblatham00
|
Oh, now I see a bit more about why the darshan logs have been hard to interpret and why so many tiny writes show up.... there's some kind of compiler step going on? The XDMF files also have a strange access pattern, but that accounts for only a few seconds. I see 1600+ files opened collectively and a whole bunch of files opened individually, so there is a lot going on. If I look only at .phdf and .rhdf5 files, the accesses look more like what i'd expect to see. now there is one strange thing: let's take a closer look at slow-mode
some process spent 100 seconds in collective writes, but only 2 seconds of that time was writing. The remaining 99 seconds? That suggests either high amounts of load imbalance (some process entered the collective say 90 seconds after everyone else did) or the two-phase aggregation is taking a lot longer than expected. With collective I/O disabled, there isn't a whole lot the MPI-IO layer does, and the darshan timings reflect this:
OK, so now we're back to the original question: why are the collectives so slow? We're all new to progressive file layout but these 4 TiB files should have some seriously large lustre stripe counts. Scot's suggested stripe counts seem like a good start but there are 1350 lustre server processes running on 450 storage nodes (three lustre servers per node). Now I"ve convinced myself we really do want to use those tuning parameters I mentioned in my last post after all , particularly |
Thanks for the detailed analysis. It looks like hdf5 chunking is the key piece that kills performance for us. I was able to run a smaller interactive job (~500 nodes) and go the following results:
Any idea why chunking has such a dramatic impact on performance? |
Chunked reads at scale are performance-broken in HDF5 ( see HDFGroup/hdf5#2658 -- independent reads sneak in despite me asking HDF5 in every way I know how to behave collectively) but I don't know of a similar scaling problem for writes off the top of my head.. Could you once again grant me permissions to read these darshan logs:
|
I noticed that reading the non-chunked data was faster than the chunked data, though the difference was not too dramatic/prohibitive.
Done. The files are in order with the numbers I reported above. |
Some observations: these darshan logs don't have HDF5 or Lustre information but still tell a tale. There are very many files in the darshan log but I'll focus on the biggest ( Case 1: baseline (hdf5 chunking enabled, 128 OST, block size 16M, MPIIO collective writes disabled): 253 seconds
I'm surprised this only took 200-some seconds. Case 2: hdf5 chunking disabled, 128 OST, block size 16M, MPIIO collective writes disabled: 8.87 seconds
Case 3: "Ok, I turned on collective I/O and got 3x worse performance" hdf5 chunking disabled, 128 OST, block size 16M, MPIIO collective writes enabled: 23.4 seconds (so collective buffering is slower than without)
I guess this is what I was saying over the summer, and confirmed by your 128 OST vs 256 OST experiment: just not enough aggregators lighting up the storage system. you have 500 nodes and 4272 processes, but this MPI-IO driver will only use 128 of them to talk to the file system. You don't want all 4272 processes blasting away and tripping over themselves, but you probably do want 1 or 2 per node. This is where the "cb_lock_mode" and "cb_nodes_multipllier" hints are supposed to help. Or crank up the stripe count to 1000. Have you run a "chunking on / collective on" configuration ? |
Thanks for the detailed analysis. For the new tests (I now also collected the MPI IO config as reported when writing the file)G:
The latter case is the identical setup to the last case of the previous set. |
First off, sorry for ghosting for the last couple weeks: was getting a workshop presentation together. Let's start with "collective + chunking" -- the configuration that should be pretty fast: base-stripe128-bs16-cbwriteon/parthenon.restart.final.rhdf is the file of interest, right? the slowest writer took 1163.095682 seconds at the MPI-IO level and only 7.043302 seconds at the POSIX level. something is indeed weird there.... The Darshan log shows what we'd expect: a lot of MPI-IO requests getting transformed into nicer Lustre writes. The log doesn't confirm how many OSTs are involved. I'll have to trust you that it really is 128. There are a lot of 512 byte independent reads at the MPI-IO level: that looks a lot like HDF5 updating the file format. I don't know how easy it would be to add a few property lists to your HDF5 reader and writer, but if you can add calls to I'd also like to see the output of some MPI-IO diagnostics: if you look at
this is going to be a fair bit of output. stash it somewhere on frontier if that's an easier way to get me the data. |
The darshan heatmap is not usually a great tool when lots of files are involved, but in this case the time is so dominated by the MPI-IO time that we see weirdness clearly: MPI-IO heatmap:POSIX heatmap:I've seen something like this on Polaris, a Slingshot-10 system, and it is not supposed to behave like this on SS-11. I went back and forth with Cray for a few days before we figured out that the "on demand" connection establishing in SS-10 was taking "forever" (47 seconds in my case). Despite the man page insisting "this is not beneficial on Slingshot-11", could you also try setting |
Wow, look at this POSIX heatmap from the second (108095) case (hdf5 chunking disabled, 256 OST, block size 8M, MPIIO collective writes enabled, cray_cb_write_lock_mode=2 cray_cb_nodes_multiplier=2: 536 second) POSIX heatmapit might be hard to read but I see almost no parallelism at the POSIX level. Can you tell me a bit about how the code is writing HDF5? For example, how many datasets are you creating/writing? How are those datasets decomposed across the processes? |
Following up on this thread: After I got sidetracked with proposal writing, I should have more cycles to look at this issue again in more detail. To answer your questions: The setup in question really only write a single dataset with dimensions num_blocks x num_variables x num_cells_z x num_cells_y x num_cells z. I'll try to get some more detailed data with the output vars you mention above over the next days and also start experimenting with subfiling and/or other output formats. |
Thanks for your persistence. I'm trying to determine if this is a defect in Cray-MPICH or if there are some tuning parameters we should adjust (the defaults are getting a bit long in the tooth) |
Alright, I got some more data -- though at smaller scale but with the same characteristics, i..e, 64 nodes with 512 ranks and each rank handling 8 blocks with 9 variables with 128^3 cells, e.g.,
Again I tested
Also I tested using the default cray-hdf5-parallel and a custom compiled 1.14.3 (no noticeable performance difference on first sight).
(the first three being the the three variations above with cray-hdf5 and the last three the ones with hdf5 1.14.3). And here are the outputs containing the additional logs from ath.out.default.extra-logs.txt |
I appreciate the extra logs but they did not contain any of the MPI-IO level logging I was expecting. Is there a separate log for error output? |
If you need a stripe count of 64 for some reason, then by all means disable collective I/O if you're getting good enough performance out of it. I do think a stripe count of 300 or more would get you better performance in all cases. |
PR Summary
In #899 I disabled the deflate filter when compression was soft disabled.
I missed that setting the chunking size was also tied to this logic so that a soft-disabled compression resulting in a chuck size of {1,1,1,1,1,1}, which tanked IO write performance.
PR Checklist