Communication Overlap Benchmark

This sofware is intended to run experiments that overlap communication and computation in order collect performance data. It runs a toy model where GPU kernels are overlap with MPI GPU communication. It provides a synchornous mode (where the host timeline is block until the communication has finished) and an aysnchronous mode where communication is split into Start and Wait methods wrapping the calls to the kernel launch.

The following picture shows an example of flow of the code for a simple communication pattern:

Requirements

CMake
CUDA
MPI
Boost: Optional

Building

Building is straightforward. Create a build directory and run CMake:

mkdir build
cd build
cmake ..

The build contains multiple options for building.

-DENABLE_BOOST_TIMER=ON will enable the boost timers to measure the performance. This requires boost.
-DENABLE_MPI_TIMER=ON will enable MPI timers.
-DVERBOSE=1 will print extra information when running the code.
-DMPI_VENDOR allows setting an MPI vendor. Typical values are: unknown mvapich2 or openmpi. Enabling this variables will simply increase the debug output.

A typical build can be achieved with:

mkdir build
cd build
cmake .. -DVERBOSE=1 -DENABLE_MPI_TIMER=1

Buildscripts for CSCS Machines

Buildscripts for CSCS machines are provided with the code. Simply run ./build_MACHINE.sh to get a build for your machine.

Running

Simply call

src/comm_overlap_benchmark

to run the code. The code can be run in several configurations:

comm_overlap_benchmark [--ie isize] [--je jsize] [--ke ksize] \
                       [--sync] [--nocomm] [--nocomp] \
                       [--nh nhaloupdates] [-n nrepetitions] [--inorder]

Command Line Arguments

--ie NN domain size in x. Default: 128
--je NN domain size in y. Default: 128
--ke NN domain size in z. Default: 60
--sync enable synchronous communication.
--nocomm disable communication.
--nocomp disable computation.
--nh NN the number of halo updates. Default: 2
-n NN the number of benchmark repetetitions. Default: 5000
--inorder enable in order halo exchanges.

Typical configurations

Running without any MPI configuration (e.g. test the perofrmance of the CUDA kernel):

src/comm_overlap_benchmark --nocomm

Profiling the kernel with CUDA:

 nvprof --analysis-metrics -o benchmark.prof ./src/comm_overlap_benchmark --nocomm --nrep=1

Runscripts for CSCS machines

Runscripts for CSCS machines are provided with the code. Simply run ./run_MACHINE.sh. The scripts accepts arguments such as the number of MPI ranks jobs=$MPI_RANKS and the GDR setup G2G=1:

jobs=8 G2G=1 ./run_kesch.sh
jobs=16 G2G=1 ./run_kesch.sh
jobs=32 G2G=1 ./run_kesch.sh

Disclaimer

Code needs cleanup...

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
doc		doc
mpi_callback		mpi_callback
src		src
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build_manali.sh		build_manali.sh
run_manali.sh		run_manali.sh
script.sh		script.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Communication Overlap Benchmark

Requirements

Building

Buildscripts for CSCS Machines

Running

Command Line Arguments

Typical configurations

Runscripts for CSCS machines

Disclaimer

About

Releases

Packages

Contributors 3

Languages

License

MeteoSwiss-APN/comm_overlap_bench

Folders and files

Latest commit

History

Repository files navigation

Communication Overlap Benchmark

Requirements

Building

Buildscripts for CSCS Machines

Running

Command Line Arguments

Typical configurations

Runscripts for CSCS machines

Disclaimer

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages