From 588f4cdad527d041a37e3307832f41fb510efe8c Mon Sep 17 00:00:00 2001 From: GabTux Date: Tue, 23 Apr 2024 09:52:24 +0000 Subject: [PATCH] deploy: 4b964020d67b435dae7ebac7b8f5ecea1f421c58 --- index.html | 23 ++++++++++++++++++++++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/index.html b/index.html index 325a556..d140e0e 100644 --- a/index.html +++ b/index.html @@ -51,7 +51,28 @@

PPQSort

-

Image Image Image Image

Suite for PPQSort (Parallel Pattern QuickSort)

Overview

This repository offers a test and benchmark suite for PPQSort (Parallel Pattern QuickSort). These benchmarks were used to evaluate the performance of PPQSort against other parallel sorting algorithms. PPQSort is a parallel Quicksort algorithm which draws inspiration from sequential pdqsort algorithm.

For more information about PPQSort, please refer to the PPQSort repository.

+

Image Image Image Image

PPQSort (Parallel Pattern QuickSort)

Parallel Pattern Quicksort (PPQSort) is a efficient implementation of parallel quicksort algorithm, written by using only the C++20 features without using third party libraries (such as Intel TBB). PPQSort draws inspiration from pdqsort, BlockQuicksort and cpp11sort and adds some further optimizations.

Integration

PPQSort is header only implementation. All the files needed are in include directory.

Add to existing CMake project using CPM.cmake:

include(cmake/CPM.cmake)
+CPMAddPackage(
+NAME PPQSort
+GITHUB_REPOSITORY GabTux/PPQSort
+VERSION 1.0.3 # change this to latest commit or release tag
+)
+target_link_libraries(YOUR_TARGET PPQSort::PPQSort)

Alternatively use FetchContent or just checkout the repository and add the include directory to the linker flags.

Usage

PPQSort has similiar API as std::sort, you can use ppqsort::execution::<policy> policies to specify how the sort should run.

// run parallel
+ppqsort::sort(ppqsort::execution::par, input.begin(), input.end());
+
+// Specify number of threads
+ppqsort::sort(ppqsort::execution::par, input.begin(), input.end(), 16);
+
+// provide custom comparator
+ppqsort::sort(ppqsort::execution::par, input.begin(), input.end(), cmp);
+
+// force branchless variant
+ppqsort::sort(ppqsort::execution::par_force_branchless, input_str.begin(), input_str.end(), cmp);

PPQSort will by default use C++ threads, but if you prefer, you can link it with OpenMP and it will use OpenMP as a parallel backend. However you can still enforce C++ threads parallel backend even if linked with OpenMP:

#define FORCE_CPP
+#include <ppqsort.h>
+// ... rest of the code ...

Benchmark

We compared PPQSort with various parallel sorts. Benchmarks shows, that the PPQSort is one of the fastest parallel sorting algorithms across various input data and different machines.

NameAlgorithmMemory usageExternal dependenciesHighlight
PPQSortQuicksortin-placeNoneparallel pattern quicksort algorithm
GCC BQSQuicksortin-placeOpenMPallocating threads proportionally to subtask sizes
cpp11sortQuicksortin-placeNoneHeader-only, C++11 compliant
oneTBB parallel_sortquicksortout-placeoneTBBSplits input to small tasks
poolSTL sortQuicksortin-placeNoneHeader-only, C++17 compliant
Boost block_indirect_sortmerging algorithmout-placeBoostUpper bounded small memory usage
AQsortQuicksortin-placeOpenMPAllows the sorting of multiple datasets at once
MPQsortQuicksortin-placeOpenMPMultiway Parallel Quicksort
IPS4oSamplesortin-placeoneTBBDivides data into buckets and sort them recursively

Running on ARM cluster

  • Fujitsu A64FX CPU
  • NUMA architecture, 48 cores (4CPUs x 12cores)

Results for INT, input size was 2e9 (2 billions):

Image
AlgorithmRandomAscendingDescendingRotatedOrganPipeHeapTotalRank
PPQSort C++5.84s1.84s4.55s1.38s2.96s5.58s22.15s1
GCC BQS13.72s4.18s19.11s49.89s8.24s13.78s108.92s6
oneTBB43.66s0.09s8.62s13.84s8.12s43.9s118.23s9
poolSTL34.63s5.61s7.23s14.78s7.81s46.88s116.94s7
MPQsort13.35s5.74s5.77s4.67s7.71s12.87s50.11s5
cpp11sort9.58s2.47s2.66s5.47s3.42s9.9s33.5s3
AQsort24.72s3.66s23.14s21.83s22.6s25.31s121.26s8
Boost8.2s3.0s4.26s13.96s6.97s7.92s44.31s4
IPS$^4$o4.8s0.19s5.97s5.21s5.59s4.91s26.67s2

Summary

Extended benchmarks (detailed in forthcoming paper) shows that IPS4o (https://github.com/ips4o) often surpasses PPQSort in raw speed. However, IPS4o relies on the external library oneTBB (https://github.com/oneapi-src/oneTBB) introducing integration complexities. PPQSort steps up as a compelling alternative due to its:

  • Competitive Speed: Delivers performance comparable to IPS4o on most machines.
  • Hardware Agnostic: Maintains strong performance across various hardware, potentially surpassing IPS4o on specific systems, especially ARM platforms.
  • Dependency-Free: No external libraries are required, simplifying integration. For applications demanding a fast, dependency-free parallel sorting solution, PPQSort is an excellent choice.

Running Tests and Benchmarks

Bash script for running or building specific components:

$ scripts/build.sh all
+...
+$ scripts/run.sh standalone
+...

Note that the benchmark's CMake file will by default download sparse matrices (around 26GB).

Implementation

A detailed research paper exploring PPQSort's design, implementation, and performance evaluation will be available soon.