Primitives & BFS performance improvements #4751

seunghwak · 2024-11-12T19:35:16Z

This PR includes multiple updates to cut peak memory usage in graph creation and improve performance of BFS on scale-free graphs.

Add a bitmap for non-zero local degree vertices in the hypersparse region; this information can be used to quickly filter out locally zero degree vertices which don't need to be processed in multiple instances.
Store (global-)degree offsets for vertices in the hypersparse region; this information can used to quickly identify the vertices with a certain global degree (e.g. for global degree 1 vertices, we can skip inter-GPU reduction as we know each vertex has only one neighbor).
Skip kernel invocations in computing edge counts if the vertex list is empty.
Add asynchronous functions to compute edge counts. This helps in preventing unnecessary serialization when we can process multiple such functions concurrently.
Replace rmm::exec_policy with rmm::exec_policy_nosync in multiple places; the former enforces stream synchronization at the end. The latter does not.
Enforce cache line alignment in NCCL communication in multiple places (NCCL communication performance is significantly affected by cache line alignment, often leading to 30-40% or more differences).
For primitives working on a subset of vertices, broadcast a vertex list using a bitmap if the vertex frontier size is large. If the vertex frontier size is small (in case vertex_t is 8B and the local vertex partition range can fit into 4B), use vertex offsets instead of vertices to cut communication volume.
Merge multiple host scalar communication function calls to a single one.
Increase multi-stream concurrency in detail::extract_transform_e & detail::per_v_transform_reduce_e
Multiple optimizations in template specialization (for update_major == true && reduce_op == any && key type is vertex && working on a subset of vertices) in detail::per_v_transform_reduce_e (this includes pre-processing vertices with non-zero local degrees; so we don't need to process such vertices using multiple GPUs, pre-filtering of zero local degree vertices, allreduce communication to reduce shuffle communication volumes, and special treatment of global degree 1 vertices, and so on).
Multiple optimizations & specializations in detail::fill_edge_minor_property that works on a subset of vertices (this includes kernel fusion, specialization for bitmap properties including direct broadcast to the property buffer and special treatments for vertex partition boundaries, and so on).
Added multiple optimizations & specializations in transform_reduce_v_frontier_outgoing_e (especially for reduce_op::any and to cut communication volumes and to filter out (key, value) pairs that won't contribute to the final results).
Multiple low-level optimizations in direction optimizing BFS (including approximations in determining between bottom -up and top-down).
Multiple optimizations to cut peak memory usage in graph creation.

…tiple chunks

…to enh_graph_creation

…p::any

…to enh_bfs_mg

…o files

…ter supported by per_v_transform_reduce_if_outgoing_e)

…to enh_bfs_mg

…nce measurement code

…locator (this might be a too low level trick to be merged into the cugraph proper)

…to enh_prim_perf

ChuckHastings

Couple of minor comments.

ChuckHastings · 2024-11-20T21:01:25Z

cpp/src/structure/create_graph_from_edgelist_impl.cuh

@@ -369,7 +485,7 @@ create_graph_from_partitioned_edgelist(
  if (edge_partition_edgelist_edge_ids) { element_size += sizeof(edge_id_t); }
  if (edge_partition_edgelist_edge_types) { element_size += sizeof(edge_type_t); }
  auto constexpr mem_frugal_ratio =
-    0.25;  // if the expected temporary buffer size exceeds the mem_frugal_ratio of the
+    0.05;  // if the expected temporary buffer size exceeds the mem_frugal_ratio of the


Is this something we should look at updating universally, or is this still a tunable we should continue to evaluate?

We may individually tune this, but I think we may eventually create a (or few) separate file storing every tunable parameters.

ChuckHastings · 2024-11-20T21:57:44Z

cpp/tests/traversal/mg_graph500_bfs_test.cu

@@ -0,0 +1,738 @@
+/*
+ * Copyright (c) 2021-2024, NVIDIA CORPORATION.


Should this just be 2024?

Yes, and I am wondering whether we should keep this file or just delete it. We already have BFS tests, this is a bit redundant.

…to enh_prim_perf

…o enh_prim_perf

seunghwak added 30 commits July 15, 2024 14:21

add a create_graph_from_edgelist function that takes edge list in mul…

a0d1f01

…tiple chunks

update R-mat graph generators to generate edge list in multiple chunks

55513ae

Merge branch 'branch-24.08' of https://github.com/rapidsai/cugraph in…

2163fd8

…to enh_graph_creation

fix build error

a9dfb92

delete unused functions

e7b33ca

fix build errors

27ea550

add temporary performance measurement code

e5e8257

add code to broadcast frontier using a bitmap

7ec5b08

resolve merge conflicts

d6123ba

fix build error

81f51c1

update dataframe buffer utilities

69cb4f9

reduce # resizes

6adcccb

remove debug statement

bfe21fc

rename VertexFrontierBucketType to KeyBucketType

446435b

update per_v_transform_reduce_incoming|outgoing_e to support reduce_o…

df463ce

…p::any

update kernels to take KeyIterator key_first & key_last

222148d

update per_v_transform_reduce_incoming_outgoing_e to support key list

effc69c

remove pred_op.cuh

d537290

update per_v_transform_reduce_incoming|outgoing_e to take a predicate

cf92885

Merge branch 'branch-24.10' of https://github.com/rapidsai/cugraph in…

b8d846c

…to enh_bfs_mg

Merge branch 'branch-24.10' of https://github.com/rapidsai/cugraph in…

50ffc33

…to enh_bfs_mg

split per_v_transform_reduce_incoming_outgoing_e implementation to tw…

a6476b9

…o files

implement per_v_transform_reduce_if_incoming|outgoing_e

ec24758

update BFS to use per_v_transform_reduce_if_outoging_e

df751e7

file rename

4661b9b

remove transform_reduce_v_frontier_outgoing_e_by_src (this can be bet…

0951741

…ter supported by per_v_transform_reduce_if_outgoing_e)

Merge branch 'branch-24.10' of https://github.com/rapidsai/cugraph in…

796b928

…to enh_bfs_mg

code cleanup, add few FIXMEs to improve performance, and add performa…

7b98e3a

…nce measurement code

performance tuning for BFS

3f77ee1

add a utility to find iteator type in dataframe buffer

bb75771

github-actions bot added the CMake label Nov 12, 2024

seunghwak mentioned this pull request Nov 12, 2024

[WIP] Improve multi-GPU BFS performance #4619

Closed

seunghwak added 14 commits November 12, 2024 12:17

remove performance log printouts

b2488f7

fix build errors

dd7357f

delete machine specific/benchmark specific tunings

b9a0ed6

delete an unused function

3107f45

remove no longer relevant include

97e50b0

delete temporary print out statements

3115b32

reverse temporary updates for testing

6507e31

additional reversing

4cb8d7b

fix build error

1c0dfba

reverse a host memory trick to reduce memory fragmentation in pool al…

b6e1714

…locator (this might be a too low level trick to be merged into the cugraph proper)

delete temporary print-out statements

dccd5c1

SG bug fix

a4ae10b

clang-format and copyright year

92640c0

Merge branch 'branch-24.12' of https://github.com/rapidsai/cugraph in…

27a5ffe

…to enh_prim_perf

seunghwak self-assigned this Nov 19, 2024

seunghwak added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 19, 2024

seunghwak added this to the 24.12 milestone Nov 19, 2024

bug fix in MG prim tests

a622d97

seunghwak changed the title ~~[WIP] Primitives & BFS performance improvements~~ Primitives & BFS performance improvements Nov 19, 2024

seunghwak requested review from ChuckHastings and jnke2016 November 19, 2024 16:08

ChuckHastings approved these changes Nov 20, 2024

View reviewed changes

seunghwak added 5 commits November 21, 2024 15:02

remove duplicate tests

9411dfb

Merge branch 'branch-24.12' of https://github.com/rapidsai/cugraph in…

1a8d6db

…to enh_prim_perf

Merge branch 'branch-24.12' of https://github.com/rapidsai/cugraph in…

6d5ae0b

…to enh_prim_perf

Merge branch 'branch-24.12' of https://github.com/rapidsai/cugraph in…

a80d7c5

…to enh_prim_perf

erge branch 'branch-24.12' of https://github.com/rapidsai/cugraph int…

7ef7d38

…o enh_prim_perf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primitives & BFS performance improvements #4751

Primitives & BFS performance improvements #4751

seunghwak commented Nov 12, 2024 •

edited

Loading

ChuckHastings left a comment

ChuckHastings Nov 20, 2024

seunghwak Nov 21, 2024

ChuckHastings Nov 20, 2024

seunghwak Nov 21, 2024

		@@ -0,0 +1,738 @@
		/*
		* Copyright (c) 2021-2024, NVIDIA CORPORATION.

Primitives & BFS performance improvements #4751

Are you sure you want to change the base?

Primitives & BFS performance improvements #4751

Conversation

seunghwak commented Nov 12, 2024 • edited Loading

ChuckHastings left a comment

Choose a reason for hiding this comment

ChuckHastings Nov 20, 2024

Choose a reason for hiding this comment

seunghwak Nov 21, 2024

Choose a reason for hiding this comment

ChuckHastings Nov 20, 2024

Choose a reason for hiding this comment

seunghwak Nov 21, 2024

Choose a reason for hiding this comment

seunghwak commented Nov 12, 2024 •

edited

Loading