parthenon-hpc-lab · lroberts36 · Oct 17, 2024 · Oct 17, 2024 · Oct 17, 2024 · Oct 18, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,7 @@
 ## Current develop
 
 ### Added (new features/APIs/variables/...)
+- [[PR 1192]](https://github.com/parthenon-hpc-lab/parthenon/pull/1103) Coalesced buffer communication
 - [[PR 1103]](https://github.com/parthenon-hpc-lab/parthenon/pull/1103) Add sparsity to vector wave equation test
 - [[PR 1185]](https://github.com/parthenon-hpc-lab/parthenon/pull/1185) Bugfix to particle defragmentation
 - [[PR 1184]](https://github.com/parthenon-hpc-lab/parthenon/pull/1184) Fix swarm block neighbor indexing in 1D, 2D

diff --git a/doc/sphinx/src/boundary_communication.rst b/doc/sphinx/src/boundary_communication.rst
@@ -476,3 +476,102 @@ For backwards compatibility, we keep the aliases
 - ``ReceiveFluxCorrections`` = ``ReceiveBoundBufs<BoundaryType::flxcor_recv>`` 
 - ``SetFluxCorrections`` = ``SetBoundBufs<BoundaryType::flxcor_recv>``
 
+Coalesced MPI Communication
+---------------------------
+
+As is described above, a one-dimensional buffer is packed and unpacked for each communicated 
+field on each pair of blocks that share a unique topological element (below we refer to this
+as a variable-boundary buffer). For codes with larger numbers of variables and/or in 
+simulations run with smaller block sizes, this can result in a large total number of buffers
+and importantly a large number of buffers that need to be communicated across MPI ranks. The
+latter fact can have significant performance implications, as each ``CommBuffer<T>::Send()``
+call for these non-local buffers corresponds to an ``MPI_Isend``. Generally, these messages
+contain a small amount of data which results in a small effective MPI bandwith. Additionally,
+MPI implementations seem to have a hard time dealing with the large number of messages
+required. In some cases, this can result in poor scaling behavior for Parthenon. 
+
+To get around this, we introduce a second level of buffers for communicating across ranks.
+For each ``MeshData`` object on a given MPI rank, coalesced buffers equal in size to all
+MPI non-local variable-boundary buffers are created for each other MPI rank that ``MeshData``
+communicates to. These coalesced buffers are then filled from the single variable-boundary
+buffers, a *single* MPI send is called per MPI rank pair, and the receiving ranks unpack the 
+coalesced buffer into the single variable-boundary buffers. This can drastically reduce the 
+number of MPI sends and increase the total amount of data sent per message, thereby
+increasing the effective bandwidth. Further, in cases where Parthenon is running on GPUs but
+GPUDirect MPI is not available, this can also minimize the number of DtoH and HtoD copies
+during communication. 
+
+To use coalesced communication, your input must include: 
+
+.. code::
+
+   parthenon/mesh/do_coalesced_comms = true
+
+curently by default this is set to ``true``.
+
+Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~
+
+The coalesced send and receive buffers for each rank are stored in ``Mesh::pcoalesced_comms``,
+which is a ``std::shared_ptr`` to a ``CoalescedComms`` object. To do coalesced communication 
+two pieces are required: 1) an initialization step telling all ranks what coalesced buffer
+messages they can expect and 2) a mechanism for packing, sending and unpacking the coalesced 
+buffers during each boundary communication step.
+
+For the first piece, after every remesh during ``BuildBoundaryBuffers``, each non-local
+variable-boundary buffer is registered with ``pcoalesced_comms``. Once all these buffers are
+registered, ``CoalescedComms::ResolveAndSendSendBuffers()`` is called, which determines all
+the coalesced buffers that are going to be sent from a given rank to every other rank, packs
+information about each of the coalesced buffers into MPI messages, and sends them to the other
+ranks so that the receiving ranks know how to interpret the messages they receive from a given
+rank. ``CoalescedComms::ReceiveBufferInfo()`` is then called to receive this information from
+other ranks. This process basically just packs ``BndId`` objects, which contain the information
+necessary to identify a variable-boundary communication channel and the amount of data that 
+is communicated across that channel, and then unpacks them on the receiving end and finds the
+correct variable-boundary buffers. These routines are called once per rank (rather than per
+``MeshData``). 
+
+For the second piece, variable-boundary buffers are first filled as normal in ``SendBoundBufs``
+but the states of the ``CommBuffer``s are updated without actually calling the associated
+``MPI_Isend``s. Then ``CoalescedComms::PackAndSend(MeshData<Real> *pmd, BoundaryType b_type)``
+is called, which for each rank pair associated with ``pmd`` packs the variable-boundary buffers
+into the coalesced buffer, packs a second message containing the sparse allocation status of 
+each variable-boundary buffer, send these two messages, and then stales the associated 
+variable-boundary buffers since their data is no longer required. On the receiving side, 
+``ReceiveBoundBufs`` receives these messages, sets the corresponding variable-boundary 
+buffers to the correct ``received`` or ``received_null`` state, and then unpacks the data
+into the buffers. Note that the messages received here do not necessarily correspond to the
+``MeshData`` that is passed to the associated ``ReceiveBoundBufs`` call, so all
+variable-boundary associated with a given receiving ``MeshData`` must still be checked for
+being in a received state. Once they are all in a received state, setting of boundaries,
+prolongation, etc. can proceed normally. 
+
+Some notes:
+- Internally ``CoalescedComms`` contains maps from MPI rank and ``BoundaryType`` (e.g. regular
+  communication, flux correction) to ``CoalescedBuffersRank`` objects for sending and receiving
+  rank pairs. These ``CoalescedBuffersRank`` objects in turn contain maps from ``MeshData``
+  partition id of the sending ``MeshData`` (which also doubles as the MPI tag for the messages) 
+  to ``CoalescedBuffer`` objects). 
+- ``CoalescedBuffersRank`` is where the post-remesh initialization routines are actually
+  implemented. This can either correspond to the send or receive side.
+- ``CoalescedBuffer`` corresponds to each coalesced buffer and is where 
+  the packing, sending, receiving, and unpacking details for coalesced boundary communication 
+  are implemented. This object internally owns the ``CommunicationBuffer<BufArray1D<Real>>``
+  that is used for sending and receiving the coalesced data (as well as the communication buffer
+  used for communicating allocation status).
+- Because Parthenon allows communication on ``MeshData`` objects that contain a subset of the 
+  ``MetaData::FillGhost`` fields in a simulation, we need to be able to interpret coalesced
+  messages that that contain a subset of fields. Most of what is needed for this is implemented 
+  in ``GetBndIdsOnDevice``.
+- Currently, there is a ``Compare`` method in ``CoalescedBuffer`` that is just for 
+  debugging. It should compare the received coalesced messages to the variable-boundary buffer 
+  messages, but using it requires some hacks in the code to send both types of buffers.
+- The coalesced buffers are sparse aware and approximately allocate the amount of space required
+  to store the *allocated* fields. This means the size of the buffers can change dynamically 
+  between steps. Currently, we allocate twice as much memory as is required to store the allocated
+  variable-boundary buffers whenever their total size becomes larger than current size of the 
+  coalesced buffer in an attempt to balance the number of allocations and memory consumption. Since
+  the receiving end does not *a priori* know the size of the coalesced messages it is going to
+  receive, we first check the size of the incoming MPI message, reallocate the coalesced receive
+  buffer if necessary, and then actually post the `Irecv`. FWIW, this prevents pre-posting
+  the `Irecv`. 
diff --git a/example/fine_advection/advection_driver.cpp b/example/fine_advection/advection_driver.cpp
@@ -95,9 +95,6 @@ TaskCollection AdvectionDriver::MakeTaskCollection(BlockList_t &blocks, const in
     auto &mc1 = pmesh->mesh_data.Add(stage_name[stage], mbase);
     auto &mdudt = pmesh->mesh_data.Add("dUdt", mbase);
 
-    auto start_send = tl.AddTask(none, parthenon::StartReceiveBoundaryBuffers, mc1);
-    auto start_flxcor = tl.AddTask(none, parthenon::StartReceiveFluxCorrections, mc0);
-
     // Make a sparse variable pack descriptors that can be used to build packs
     // including some subset of the fields in this example. This will be passed
     // to the Stokes update routines, so that they can internally create variable
@@ -146,9 +143,8 @@ TaskCollection AdvectionDriver::MakeTaskCollection(BlockList_t &blocks, const in
       }
     }
 
-    auto set_flx = parthenon::AddFluxCorrectionTasks(
-        start_flxcor | flx | flx_fine | vf_dep, tl, mc0, pmesh->multilevel);
-
+    auto set_flx = parthenon::AddFluxCorrectionTasks(flx | flx_fine | vf_dep, tl, mc0,
+                                                     pmesh->multilevel);
     auto update = set_flx;
     if (do_regular_advection) {
       update = AddUpdateTasks(set_flx, tl, parthenon::CellLevel::same, TT::Cell, beta, dt,
@@ -170,7 +166,7 @@ TaskCollection AdvectionDriver::MakeTaskCollection(BlockList_t &blocks, const in
     }
 
     auto boundaries = parthenon::AddBoundaryExchangeTasks(
-        update | update_vec | update_fine | start_send, tl, mc1, pmesh->multilevel);
+        update | update_vec | update_fine, tl, mc1, pmesh->multilevel);
 
     auto fill_derived =
         tl.AddTask(boundaries, parthenon::Update::FillDerived<MeshData<Real>>, mc1.get());

diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
@@ -101,9 +101,13 @@ add_library(parthenon
   bvals/comms/bvals_in_one.hpp
   bvals/comms/bvals_utils.hpp
   bvals/comms/build_boundary_buffers.cpp
+  bvals/comms/bnd_id.cpp
+  bvals/comms/bnd_id.hpp
   bvals/comms/bnd_info.cpp
   bvals/comms/bnd_info.hpp
   bvals/comms/boundary_communication.cpp
+  bvals/comms/coalesced_buffers.cpp
+  bvals/comms/coalesced_buffers.hpp
   bvals/comms/tag_map.cpp
   bvals/comms/tag_map.hpp
 

diff --git a/src/basic_types.hpp b/src/basic_types.hpp
@@ -77,6 +77,36 @@ enum class BoundaryType : int {
   gmg_prolongate_recv
 };
 
+inline constexpr bool IsSender(BoundaryType btype) {
+  if (btype == BoundaryType::flxcor_recv) return false;
+  if (btype == BoundaryType::gmg_restrict_recv) return false;
+  if (btype == BoundaryType::gmg_prolongate_recv) return false;
+  return true;
+}
+
+inline constexpr bool IsReceiver(BoundaryType btype) {
+  if (btype == BoundaryType::flxcor_send) return false;
+  if (btype == BoundaryType::gmg_restrict_send) return false;
+  if (btype == BoundaryType::gmg_prolongate_send) return false;
+  return true;
+}
+
+inline constexpr BoundaryType GetAssociatedReceiver(BoundaryType btype) {
+  if (btype == BoundaryType::flxcor_send) return BoundaryType::flxcor_recv;
+  if (btype == BoundaryType::gmg_restrict_send) return BoundaryType::gmg_restrict_recv;
+  if (btype == BoundaryType::gmg_prolongate_send)
+    return BoundaryType::gmg_prolongate_recv;
+  return btype;
+}
+
+inline constexpr BoundaryType GetAssociatedSender(BoundaryType btype) {
+  if (btype == BoundaryType::flxcor_recv) return BoundaryType::flxcor_send;
+  if (btype == BoundaryType::gmg_restrict_recv) return BoundaryType::gmg_restrict_send;
+  if (btype == BoundaryType::gmg_prolongate_recv)
+    return BoundaryType::gmg_prolongate_send;
+  return btype;
+}
+
 enum class GridType : int { none, leaf, two_level_composite, single_level_with_internal };
 struct GridIdentifier {
   GridType type = GridType::none;
@@ -102,20 +132,6 @@ inline bool operator<(const GridIdentifier &lhs, const GridIdentifier &rhs) {
   return lhs.logical_level < rhs.logical_level;
 }
 
-constexpr bool IsSender(BoundaryType btype) {
-  if (btype == BoundaryType::flxcor_recv) return false;
-  if (btype == BoundaryType::gmg_restrict_recv) return false;
-  if (btype == BoundaryType::gmg_prolongate_recv) return false;
-  return true;
-}
-
-constexpr bool IsReceiver(BoundaryType btype) {
-  if (btype == BoundaryType::flxcor_send) return false;
-  if (btype == BoundaryType::gmg_restrict_send) return false;
-  if (btype == BoundaryType::gmg_prolongate_send) return false;
-  return true;
-}
-
 // Enumeration for accessing a field on different locations of the grid:
 // CC = cell center of (i, j, k)
 // F1 = x-normal face at (i - 1/2, j, k)

diff --git a/src/bvals/comms/bnd_id.cpp b/src/bvals/comms/bnd_id.cpp
@@ -0,0 +1,69 @@
+//========================================================================================
+// Parthenon performance portable AMR framework
+// Copyright(C) 2024 The Parthenon collaboration
+// Licensed under the 3-clause BSD License, see LICENSE file for details
+//========================================================================================
+// (C) (or copyright) 2020-2024. Triad National Security, LLC. All rights reserved.
+//
+// This program was produced under U.S. Government contract 89233218CNA000001 for Los
+// Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC
+// for the U.S. Department of Energy/National Nuclear Security Administration. All rights
+// in the program are reserved by Triad National Security, LLC, and the U.S. Department
+// of Energy/National Nuclear Security Administration. The Government is granted for
+// itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide
+// license in this material to reproduce, prepare derivative works, distribute copies to
+// the public, perform publicly and display publicly, and to permit others to do so.
+//========================================================================================
+
+#include <algorithm>
+#include <cstdio>
+#include <iostream> // debug
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "basic_types.hpp"
+#include "bvals/comms/bnd_id.hpp"
+#include "bvals/comms/bvals_utils.hpp"
+#include "bvals/neighbor_block.hpp"
+#include "config.hpp"
+#include "globals.hpp"
+#include "interface/state_descriptor.hpp"
+#include "interface/variable.hpp"
+#include "kokkos_abstraction.hpp"
+#include "mesh/domain.hpp"
+#include "mesh/mesh.hpp"
+#include "mesh/mesh_refinement.hpp"
+#include "mesh/meshblock.hpp"
+#include "prolong_restrict/prolong_restrict.hpp"
+#include "utils/error_checking.hpp"
+
+namespace parthenon {
+
+BndId BndId::GetSend(MeshBlock *pmb, const NeighborBlock &nb,
+                     std::shared_ptr<Variable<Real>> v, BoundaryType b_type,
+                     int partition, int start_idx) {
+  auto [send_gid, recv_gid, vlabel, loc, extra_id] = SendKey(pmb, nb, v, b_type);
+  BndId out;
+  out.send_gid() = send_gid;
+  out.recv_gid() = recv_gid;
+  out.loc_idx() = loc;
+  out.var_id() = v->GetUniqueID();
+  out.extra_id() = extra_id;
+  out.rank_send() = Globals::my_rank;
+  out.rank_recv() = nb.rank;
+  out.partition() = partition;
+  out.size() = BndInfo::GetSendBndInfo(pmb, nb, v, nullptr).size();
+  out.start_idx() = start_idx;
+  return out;
+}
+
+void BndId::PrintInfo(const std::string &start) {
+  printf("%s var %s (%i -> %i) starting at %i with size %i (Total combined buffer size = "
+         "%i, buffer size = %i, buf_allocated = %i) [rank = %i]\n",
+         start.c_str(), Variable<Real>::GetLabel(var_id()).c_str(), send_gid(),
+         recv_gid(), start_idx(), size(), coalesced_buf.size(), buf.size(), buf_allocated,
+         Globals::my_rank);
+}
+
+} // namespace parthenon
diff --git a/src/bvals/comms/bnd_id.hpp b/src/bvals/comms/bnd_id.hpp
@@ -0,0 +1,111 @@
+//========================================================================================
+// Parthenon performance portable AMR framework
+// Copyright(C) 2024 The Parthenon collaboration
+// Licensed under the 3-clause BSD License, see LICENSE file for details
+//========================================================================================
+// (C) (or copyright) 2020-2024. Triad National Security, LLC. All rights reserved.
+//
+// This program was produced under U.S. Government contract 89233218CNA000001 for Los
+// Alamos National Laboratory (LANL), which is operated by Triad National Security, LLC
+// for the U.S. Department of Energy/National Nuclear Security Administration. All rights
+// in the program are reserved by Triad National Security, LLC, and the U.S. Department
+// of Energy/National Nuclear Security Administration. The Government is granted for
+// itself and others acting on its behalf a nonexclusive, paid-up, irrevocable worldwide
+// license in this material to reproduce, prepare derivative works, distribute copies to
+// the public, perform publicly and display publicly, and to permit others to do so.
+//========================================================================================
+
+#ifndef BVALS_COMMS_BND_ID_HPP_
+#define BVALS_COMMS_BND_ID_HPP_
+
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "basic_types.hpp"
+#include "bvals/neighbor_block.hpp"
+#include "coordinates/coordinates.hpp"
+#include "interface/variable_state.hpp"
+#include "mesh/domain.hpp"
+#include "mesh/forest/logical_coordinate_transformation.hpp"
+#include "utils/communication_buffer.hpp"
+#include "utils/indexer.hpp"
+#include "utils/object_pool.hpp"
+
+namespace parthenon {
+
+template <typename T>
+class Variable;
+
+// Provides the information necessary for identifying a unique variable-boundary
+// buffer, identifying the coalesced buffer it is associated with, and its
+// position within the coalesced buffer.
+struct BndId {
+  constexpr static std::size_t NDAT = 10;
+  int data[NDAT];
+
+  // Information for identifying the buffer with a communication
+  // channel, variable, and the ranks it is communicated across
+  KOKKOS_FORCEINLINE_FUNCTION
+  int &send_gid() { return data[0]; }
+  KOKKOS_FORCEINLINE_FUNCTION
+  int &recv_gid() { return data[1]; }
+  KOKKOS_FORCEINLINE_FUNCTION
+  int &loc_idx() { return data[2]; }
+  KOKKOS_FORCEINLINE_FUNCTION
+  int &var_id() { return data[3]; }
+  KOKKOS_FORCEINLINE_FUNCTION
+  int &extra_id() { return data[4]; }
+  KOKKOS_FORCEINLINE_FUNCTION
+  int &rank_send() { return data[5]; }
+  KOKKOS_FORCEINLINE_FUNCTION
+  int &rank_recv() { return data[6]; }
+  BoundaryType bound_type;
+
+  // MeshData partition id of the *sender*
+  // not set by constructors and only necessary for coalesced comms
+  KOKKOS_FORCEINLINE_FUNCTION
+  int &partition() { return data[7]; }
+  KOKKOS_FORCEINLINE_FUNCTION
+  int &size() { return data[8]; }
+  KOKKOS_FORCEINLINE_FUNCTION
+  int &start_idx() { return data[9]; }
+
+  bool buf_allocated;
+  buf_pool_t<Real>::weak_t buf;   // comm buffer from pool
+  BufArray1D<Real> coalesced_buf; // Combined buffer
+
+  void PrintInfo(const std::string &start);
+
+  KOKKOS_DEFAULTED_FUNCTION
+  BndId() = default;
+  KOKKOS_DEFAULTED_FUNCTION
+  BndId(const BndId &) = default;
+
+  explicit BndId(const int *const data_in) {
+    for (int i = 0; i < NDAT; ++i) {
+      data[i] = data_in[i];
+    }
+  }
+
+  void Serialize(int *data_out) {
+    for (int i = 0; i < NDAT; ++i) {
+      data_out[i] = data[i];
+    }
+  }
+
+  bool SameBVChannel(const BndId &other) {
+    // Don't want to compare start_idx, so -1
+    for (int i = 0; i < NDAT - 1; ++i) {
+      if (data[i] != other.data[i]) return false;
+    }
+    return true;
+  }
+
+  static BndId GetSend(MeshBlock *pmb, const NeighborBlock &nb,
+                       std::shared_ptr<Variable<Real>> v, BoundaryType b_type,
+                       int partition, int start_idx);
+};
+} // namespace parthenon
+
+#endif // BVALS_COMMS_BND_ID_HPP_