Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primitives & BFS performance improvements #4751

Open
wants to merge 158 commits into
base: branch-24.12
Choose a base branch
from

Conversation

seunghwak
Copy link
Contributor

@seunghwak seunghwak commented Nov 12, 2024

This PR includes multiple updates to cut peak memory usage in graph creation and improve performance of BFS on scale-free graphs.

  • Add a bitmap for non-zero local degree vertices in the hypersparse region; this information can be used to quickly filter out locally zero degree vertices which don't need to be processed in multiple instances.
  • Store (global-)degree offsets for vertices in the hypersparse region; this information can used to quickly identify the vertices with a certain global degree (e.g. for global degree 1 vertices, we can skip inter-GPU reduction as we know each vertex has only one neighbor).
  • Skip kernel invocations in computing edge counts if the vertex list is empty.
  • Add asynchronous functions to compute edge counts. This helps in preventing unnecessary serialization when we can process multiple such functions concurrently.
  • Replace rmm::exec_policy with rmm::exec_policy_nosync in multiple places; the former enforces stream synchronization at the end. The latter does not.
  • Enforce cache line alignment in NCCL communication in multiple places (NCCL communication performance is significantly affected by cache line alignment, often leading to 30-40% or more differences).
  • For primitives working on a subset of vertices, broadcast a vertex list using a bitmap if the vertex frontier size is large. If the vertex frontier size is small (in case vertex_t is 8B and the local vertex partition range can fit into 4B), use vertex offsets instead of vertices to cut communication volume.
  • Merge multiple host scalar communication function calls to a single one.
  • Increase multi-stream concurrency in detail::extract_transform_e & detail::per_v_transform_reduce_e
  • Multiple optimizations in template specialization (for update_major == true && reduce_op == any && key type is vertex && working on a subset of vertices) in detail::per_v_transform_reduce_e (this includes pre-processing vertices with non-zero local degrees; so we don't need to process such vertices using multiple GPUs, pre-filtering of zero local degree vertices, allreduce communication to reduce shuffle communication volumes, and special treatment of global degree 1 vertices, and so on).
  • Multiple optimizations & specializations in detail::fill_edge_minor_property that works on a subset of vertices (this includes kernel fusion, specialization for bitmap properties including direct broadcast to the property buffer and special treatments for vertex partition boundaries, and so on).
  • Added multiple optimizations & specializations in transform_reduce_v_frontier_outgoing_e (especially for reduce_op::any and to cut communication volumes and to filter out (key, value) pairs that won't contribute to the final results).
  • Multiple low-level optimizations in direction optimizing BFS (including approximations in determining between bottom -up and top-down).
  • Multiple optimizations to cut peak memory usage in graph creation.

…ter supported by per_v_transform_reduce_if_outgoing_e)
@seunghwak seunghwak self-assigned this Nov 19, 2024
@seunghwak seunghwak added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 19, 2024
@seunghwak seunghwak added this to the 24.12 milestone Nov 19, 2024
@seunghwak seunghwak changed the title [WIP] Primitives & BFS performance improvements Primitives & BFS performance improvements Nov 19, 2024
Copy link
Collaborator

@ChuckHastings ChuckHastings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments.

@@ -369,7 +485,7 @@ create_graph_from_partitioned_edgelist(
if (edge_partition_edgelist_edge_ids) { element_size += sizeof(edge_id_t); }
if (edge_partition_edgelist_edge_types) { element_size += sizeof(edge_type_t); }
auto constexpr mem_frugal_ratio =
0.25; // if the expected temporary buffer size exceeds the mem_frugal_ratio of the
0.05; // if the expected temporary buffer size exceeds the mem_frugal_ratio of the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we should look at updating universally, or is this still a tunable we should continue to evaluate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may individually tune this, but I think we may eventually create a (or few) separate file storing every tunable parameters.

@@ -0,0 +1,738 @@
/*
* Copyright (c) 2021-2024, NVIDIA CORPORATION.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be 2024?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I am wondering whether we should keep this file or just delete it. We already have BFS tests, this is a bit redundant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cuGraph improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants