-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coalesced Buffer Communication #1192
base: develop
Are you sure you want to change the base?
Coalesced Buffer Communication #1192
Conversation
TODO:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These small comments are probably just a distraction right now but figured I would record them. Nothing stands out to me as an issue right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG(reat)TM! I only had some small queries, I didn't detect any issues in the logic.
do_coalesced_comms{ | ||
pin->GetOrAddBoolean("parthenon/mesh", "do_coalesced_comms", true)} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think default to true is reasonable, we already have some evidence that your solution produces improved performance at least outside of AthenaPK-style workflows
The macOS CI failed with this error:
Edit: nvm, it fails here too: https://github.com/parthenon-hpc-lab/parthenon/actions/runs/12014756878/job/33491277357?pr=1192 |
Co-authored-by: Ben Ryan <brryan@lanl.gov>
Co-authored-by: Ben Ryan <brryan@lanl.gov>
@BenWibking: Yeah, I think this is a problem with |
I'll take a look at this, I don't understand how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really impressive that such a big feature is encapsulated in only a 1200 line diff. Nice work. Really excited to try this out. Testing it in riot on a few cores on my laptop now.
.. code:: | ||
|
||
parthenon/mesh/do_coalesced_comms = true | ||
|
||
curently by default this is set to ``true``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we think this works for all downstreams including kharma, artemis and riot I am in favor of the default being true. If there's some doubt, we should maybe change the default to false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I tend to lean toward default false
until there is more downstream testing. To make sure it is passing regression tests, it needs to be set to true
for now though (or we would have to change all the parameter input). There is some discussion of this above though where @brryan suggested we keep true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with it being default true
. But I would also be fine modifying all the tests to set it to true manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in principle also happy with default true (assuming that all downstream codes work/perform as expected as others already noted).
- Currently, there is a ``Compare`` method in ``CoalescedBuffer`` that is just for | ||
debugging. It should compare the received coalesced messages to the variable-boundary buffer | ||
messages, but using it requires some hacks in the code to send both types of buffers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the hacks? Might be worth saying what to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I removed CoalescedBuffer::Compare
at some point, so this note isn't so useful (and the hacks are a bit hard to quickly describe here). As a result, I just removed this point from the doc.
struct uid_set_hash { | ||
std::size_t operator()(const std::set<Uid_t> &in) const { | ||
std::size_t lhs{0}; | ||
for (const auto &uid : in) { | ||
std::size_t rhs = std::hash<Uid_t>()(uid); | ||
lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2); | ||
} | ||
return lhs; | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is like the third version of hash
we've implemented lol. Is there any way to share some code between hashers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, maybe "Unify hash functions" can be added as a good first PR issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to admit I did not understand everything going on in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do appreciate all the comments in this file (in combination with the doc above), but I also have to second that I'll trust the regression (and downstream) testing that the logic in here works as planned.
Building on clang I get these warnings which would be nice to remove by being careful about print statements [ 55%] Building CXX object parthenon/src/CMakeFiles/parthenon.dir/bvals/bvals.cpp.o
/home/jonahm/programming/riot/external/parthenon/src/bvals/comms/bnd_id.cpp: In member function ‘void parthenon::BndId::PrintInfo(const string&)’:
/home/jonahm/programming/riot/external/parthenon/src/bvals/comms/bnd_id.cpp:63:12: warning: format ‘%i’ expects argument of type ‘int’, but argument 8 has type ‘size_t’ {aka ‘long unsigned int’} [-Wformat=]
63 | "%i, buffer size = %i, buf_allocated = %i) [rank = %i]\n",
| ~^
| |
| int
| %li
64 | start.c_str(), Variable<Real>::GetLabel(var_id()).c_str(), send_gid(),
65 | recv_gid(), start_idx(), size(), coalesced_buf.size(), buf.size(), buf_allocated,
| ~~~~~~~~~~~~~~~~~~~~
| |
| size_t {aka long unsigned int}
/home/jonahm/programming/riot/external/parthenon/src/bvals/comms/bnd_id.cpp:63:30: warning: format ‘%i’ expects argument of type ‘int’, but argument 9 has type ‘size_t’ {aka ‘long unsigned int’} [-Wformat=]
63 | "%i, buffer size = %i, buf_allocated = %i) [rank = %i]\n",
| ~^
| |
| int
| %li
64 | start.c_str(), Variable<Real>::GetLabel(var_id()).c_str(), send_gid(),
65 | recv_gid(), start_idx(), size(), coalesced_buf.size(), buf.size(), buf_allocated,
| ~~~~~~~~~~
| |
| size_t {aka long unsigned int}
Should be easy enough to just do what the compiler says and change |
So naively running it in riot on the triple problem, coalesced comms hangs forever with sparsity on. Thoughts @lroberts36 ? If that's not an easy fix, I'm ok merging this but we need to change the default to |
Looks like it still hangs in riot if I turn off sparsity. Suggesting it maybe has to do with how riot uses mesh data? |
@Yurlungur: hm, let me take a look at it to see if there is some easy fix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good and I think I conceptually followed what's going on.
I got some additional comments (and the view of view alloc might need to be addressed).
I also plan to do some downstream testing this week (ideally tomorrow knowing your schedule) and then approve if things work as expected.
.. code:: | ||
|
||
parthenon/mesh/do_coalesced_comms = true | ||
|
||
curently by default this is set to ``true``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in principle also happy with default true (assuming that all downstream codes work/perform as expected as others already noted).
same_to_same = pmb->gid == nb.gid && nb.offsets.IsCell(); | ||
lcoord_trans = nb.lcoord_trans; | ||
if (!allocated) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can/should we go past this point now?
Was this related to the bug with the buffer size being 0 on first pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously, I just bailed here because there was no point in doing the extra index range calculations. As you note, this is what caused the buffer size 0 on the first pass bug. Removing it shouldn't impact any behavior in pre-existing code and I doubt it had any noticeable performance impact.
auto &bids = GetBndIdsOnDevice(vars, &comb_size); | ||
Kokkos::parallel_for( | ||
PARTHENON_AUTO_LABEL, | ||
Kokkos::TeamPolicy<>(parthenon::DevExecSpace(), bids.size(), Kokkos::AUTO), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bids
is potentially a low number (lower than the number of compute unites per device), isn't it?
Just so that we keep this in mind wrt the kernels' efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this could definitely be somewhere we don't produce enough teams.
// Unpack into per combined buffer information | ||
int idx{nglobal}; | ||
|
||
for (int p = 0; p < npartitions; ++p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this piece of code but a more general question/comment. Do we still support arbitary partitions during runtime (so except for the "all blocks" and the "default split based on pack_size
")?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR only allows for coalesced communication on the default partition (i.e. the one determined by pack_size
). That is because there is a single coalesced message (plus a message for sparse info) for each MeshData
. The contents of these messages need to be determined and sent to neighbor ranks during remeshing and we only do this for the default partitions. I am guessing we could add communication information for other partitions if we really wanted to at the same time, but there are probably some subtle gotchas in there.
Of course, I don't think this prevents you from setting up a different partitioning that you don't communicate on. Also, I think the issues with this PR we are seeing in Riot are related to an issue with the "all blocks" partition (i.e. it doesn't get a partition id assigned). It is probably necessary to generalize partition ids to include the all blocks partition (as well as other possible partitions maybe).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do appreciate all the comments in this file (in combination with the doc above), but I also have to second that I'll trust the regression (and downstream) testing that the logic in here works as planned.
I now got the chance to do some downstream testing. Also looks like there's a small performance penalty using the coalesced buffers even for many blocks per device. |
Based on this, and based on the fact that even after modifying riot to look more like the examples, I can't get it to cycle with coalesced comms on, I think we should change the default to disabling coalesced comms. |
PR Summary
Coalesced buffer communication, see the included docs for a description. This came from the combined buffer communication work we did for the TACC Hackathon.
PR Checklist