diff --git a/index.html b/index.html
index 325a556..d140e0e 100644
--- a/index.html
+++ b/index.html
@@ -51,7 +51,28 @@
         <h1>
           PPQSort
         </h1>
-<p><a name="md__r_e_a_d_m_e"></a> <a href="https://github.com/GabTux/ppqsort_suite/actions/workflows/standalone.yml"><img class="m-image" src="https://github.com/GabTux/ppqsort_suite/actions/workflows/standalone.yml/badge.svg" alt="Image" /></a> <a href="https://github.com/GabTux/ppqsort_suite/actions/workflows/install.yml"><img class="m-image" src="https://github.com/GabTux/ppqsort_suite/actions/workflows/install.yml/badge.svg" alt="Image" /></a> <a href="https://github.com/GabTux/ppqsort_suite/actions/workflows/tests.yml"><img class="m-image" src="https://github.com/GabTux/ppqsort_suite/actions/workflows/tests.yml/badge.svg" alt="Image" /></a> <a href="https://codecov.io/gh/GabTux/ppqsort_suite"><img class="m-image" src="https://codecov.io/gh/GabTux/ppqsort_suite/graph/badge.svg?token=K7UVUZ4N1N" alt="Image" /></a></p><section id="autotoc_md0"><h2><a href="#autotoc_md0">Suite for PPQSort (Parallel Pattern QuickSort)</a></h2><section id="autotoc_md1"><h3><a href="#autotoc_md1">Overview</a></h3><p>This repository offers a test and benchmark suite for PPQSort (Parallel Pattern QuickSort). These benchmarks were used to evaluate the performance of PPQSort against other parallel sorting algorithms. PPQSort is a parallel Quicksort algorithm which draws inspiration from sequential <a href="https://github.com/orlp/pdqsort">pdqsort</a> algorithm.</p><p>For more information about PPQSort, please refer to the <a href="https://github.com/GabTux/ppqsort">PPQSort repository</a>.</p></section></section>
+<p><a name="md__r_e_a_d_m_e"></a> <a href="https://github.com/GabTux/ppqsort_suite/actions/workflows/standalone.yml"><img class="m-image" src="https://github.com/GabTux/ppqsort_suite/actions/workflows/standalone.yml/badge.svg" alt="Image" /></a> <a href="https://github.com/GabTux/ppqsort_suite/actions/workflows/install.yml"><img class="m-image" src="https://github.com/GabTux/ppqsort_suite/actions/workflows/install.yml/badge.svg" alt="Image" /></a> <a href="https://github.com/GabTux/ppqsort_suite/actions/workflows/tests.yml"><img class="m-image" src="https://github.com/GabTux/ppqsort_suite/actions/workflows/tests.yml/badge.svg" alt="Image" /></a> <a href="https://codecov.io/github/GabTux/PPQSort"><img class="m-image" src="https://codecov.io/github/GabTux/PPQSort/graph/badge.svg?token=K7UVUZ4N1N" alt="Image" /></a></p><section id="autotoc_md0"><h2><a href="#autotoc_md0">PPQSort (Parallel Pattern QuickSort)</a></h2><p>Parallel Pattern Quicksort (PPQSort) is a <strong>efficient implementation of parallel quicksort algorithm</strong>, written by using <strong>only</strong> the C++20 features without using third party libraries (such as Intel TBB). PPQSort draws inspiration from <a href="https://github.com/orlp/pdqsort">pdqsort</a>, <a href="https://github.com/weissan/BlockQuicksort">BlockQuicksort</a> and <a href="https://gitlab.com/daniel.langr/cpp11sort">cpp11sort</a> and adds some further optimizations.</p><ul><li><strong>Focus on ease of use:</strong> Using only C++20 features, header only implementation, user-friendly API.</li><li><strong>Comprehensive test suite:</strong> Ensures correctness and robustness through extensive testing.</li><li><strong>Benchmarks shows great performance:</strong> Achieves impressive sorting times on various machines.</li></ul></section><section id="autotoc_md1"><h2><a href="#autotoc_md1">Integration</a></h2><p>PPQSort is header only implementation. All the files needed are in include directory.</p><p>Add to existing CMake project using <a href="https://github.com/cpm-cmake/CPM.cmake">CPM.cmake</a>:</p><pre class="m-code">include(cmake/CPM.cmake)
+CPMAddPackage(
+NAME PPQSort
+GITHUB_REPOSITORY GabTux/PPQSort
+VERSION 1.0.3 # change this to latest commit or release tag
+)
+target_link_libraries(YOUR_TARGET PPQSort::PPQSort)</pre><p>Alternatively use FetchContent or just checkout the repository and add the include directory to the linker flags.</p></section><section id="autotoc_md2"><h2><a href="#autotoc_md2">Usage</a></h2><p>PPQSort has similiar API as std::sort, you can use <code><a href="namespaceppqsort_1_1execution.html" class="m-doc">ppqsort::<wbr />execution</a>::&lt;policy&gt;</code> policies to specify how the sort should run.</p><pre class="m-code"><span class="c1">// run parallel</span>
+<span class="n">ppqsort</span><span class="o">::</span><span class="n">sort</span><span class="p">(</span><span class="n">ppqsort</span><span class="o">::</span><span class="n">execution</span><span class="o">::</span><span class="n">par</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span><span class="w"> </span><span class="n">input</span><span class="p">.</span><span class="n">end</span><span class="p">());</span>
+
+<span class="c1">// Specify number of threads</span>
+<span class="n">ppqsort</span><span class="o">::</span><span class="n">sort</span><span class="p">(</span><span class="n">ppqsort</span><span class="o">::</span><span class="n">execution</span><span class="o">::</span><span class="n">par</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span><span class="w"> </span><span class="n">input</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span><span class="w"> </span><span class="mi">16</span><span class="p">);</span>
+
+<span class="c1">// provide custom comparator</span>
+<span class="n">ppqsort</span><span class="o">::</span><span class="n">sort</span><span class="p">(</span><span class="n">ppqsort</span><span class="o">::</span><span class="n">execution</span><span class="o">::</span><span class="n">par</span><span class="p">,</span><span class="w"> </span><span class="n">input</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span><span class="w"> </span><span class="n">input</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span><span class="w"> </span><span class="n">cmp</span><span class="p">);</span>
+
+<span class="c1">// force branchless variant</span>
+<span class="n">ppqsort</span><span class="o">::</span><span class="n">sort</span><span class="p">(</span><span class="n">ppqsort</span><span class="o">::</span><span class="n">execution</span><span class="o">::</span><span class="n">par_force_branchless</span><span class="p">,</span><span class="w"> </span><span class="n">input_str</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span><span class="w"> </span><span class="n">input_str</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span><span class="w"> </span><span class="n">cmp</span><span class="p">);</span></pre><p>PPQSort will by default use C++ threads, but if you prefer, you can link it with OpenMP and it will use OpenMP as a parallel backend. However you can still enforce C++ threads parallel backend even if linked with OpenMP:</p><pre class="m-code"><span class="cp">#define FORCE_CPP</span>
+<span class="cp">#include</span><span class="w"> </span><span class="cpf">&lt;ppqsort.h&gt;</span>
+<span class="c1">// ... rest of the code ...</span></pre></section><section id="autotoc_md3"><h2><a href="#autotoc_md3">Benchmark</a></h2><p>We compared PPQSort with various parallel sorts. Benchmarks shows, that the PPQSort is one of the fastest parallel sorting algorithms across various input data and different machines.</p><table class="m-table"><thead><tr><th>Name</th><th>Algorithm</th><th>Memory usage</th><th>External dependencies</th><th>Highlight</th></tr></thead><tbody><tr><td>PPQSort</td><td>Quicksort</td><td>in-place</td><td><strong>None</strong></td><td>parallel pattern quicksort algorithm</td></tr><tr><td><a href="https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_design.html">GCC BQS</a></td><td>Quicksort</td><td>in-place</td><td>OpenMP</td><td>allocating threads proportionally to subtask sizes</td></tr><tr><td><a href="https://gitlab.com/daniel.langr/cpp11sort">cpp11sort</a></td><td>Quicksort</td><td>in-place</td><td><strong>None</strong></td><td>Header-only, C++11 compliant</td></tr><tr><td>oneTBB <a href="https://spec.oneapi.io/versions/latest/elements/oneTBB/source/algorithms/functions/parallel_sort_func.html#parallel-sort">parallel_<wbr />sort</a></td><td>quicksort</td><td>out-place</td><td>oneTBB</td><td>Splits input to small tasks</td></tr><tr><td><a href="https://github.com/alugowski/poolSTL">poolSTL</a> sort</td><td>Quicksort</td><td>in-place</td><td><strong>None</strong></td><td>Header-only, C++17 compliant</td></tr><tr><td>Boost <a href="https://www.boost.org/doc/libs/develop/libs/sort/doc/html/sort/parallel.html#sort.parallel.block_indirect_sort">block_<wbr />indirect_<wbr />sort</a></td><td>merging algorithm</td><td>out-place</td><td>Boost</td><td>Upper bounded small memory usage</td></tr><tr><td><a href="https://github.com/DanielLangr/AQsort">AQsort</a></td><td>Quicksort</td><td>in-place</td><td>OpenMP</td><td>Allows the sorting of multiple datasets at once</td></tr><tr><td><a href="https://github.com/voronond/MPQsort/">MPQsort</a></td><td>Quicksort</td><td>in-place</td><td>OpenMP</td><td>Multiway Parallel Quicksort</td></tr><tr><td><a href="https://github.com/ips4o/ips4o">IPS4o</a></td><td>Samplesort</td><td>in-place</td><td>oneTBB</td><td>Divides data into buckets and sort them recursively</td></tr></tbody></table><section id="autotoc_md4"><h3><a href="#autotoc_md4">Running on ARM cluster</a></h3><ul><li>Fujitsu A64FX CPU</li><li>NUMA architecture, 48 cores (4CPUs x 12cores)</li></ul><p>Results for <strong>INT</strong>, input size was <strong>2e9</strong> (2 billions):</p><img class="m-image" src="https://github.com/GabTux/PPQSort/assets/24779546/95741ffe-d710-4360-afac-fa5dce3c50c1" alt="Image" /><table class="m-table"><thead><tr><th>Algorithm</th><th>Random</th><th>Ascending</th><th>Descending</th><th>Rotated</th><th>OrganPipe</th><th>Heap</th><th>Total</th><th>Rank</th></tr></thead><tbody><tr><td>PPQSort C++</td><td>5.84s</td><td>1.84s</td><td>4.55s</td><td><strong>1.38s</strong></td><td><strong>2.96s</strong></td><td>5.58s</td><td><strong>22.15s</strong></td><td><strong>1</strong></td></tr><tr><td>GCC BQS</td><td>13.72s</td><td>4.18s</td><td>19.11s</td><td>49.89s</td><td>8.24s</td><td>13.78s</td><td>108.92s</td><td>6</td></tr><tr><td>oneTBB</td><td>43.66s</td><td><strong>0.09s</strong></td><td>8.62s</td><td>13.84s</td><td>8.12s</td><td>43.9s</td><td>118.23s</td><td>9</td></tr><tr><td>poolSTL</td><td>34.63s</td><td>5.61s</td><td>7.23s</td><td>14.78s</td><td>7.81s</td><td>46.88s</td><td>116.94s</td><td>7</td></tr><tr><td>MPQsort</td><td>13.35s</td><td>5.74s</td><td>5.77s</td><td>4.67s</td><td>7.71s</td><td>12.87s</td><td>50.11s</td><td>5</td></tr><tr><td>cpp11sort</td><td>9.58s</td><td>2.47s</td><td><strong>2.66s</strong></td><td>5.47s</td><td>3.42s</td><td>9.9s</td><td>33.5s</td><td>3</td></tr><tr><td>AQsort</td><td>24.72s</td><td>3.66s</td><td>23.14s</td><td>21.83s</td><td>22.6s</td><td>25.31s</td><td>121.26s</td><td>8</td></tr><tr><td>Boost</td><td>8.2s</td><td>3.0s</td><td>4.26s</td><td>13.96s</td><td>6.97s</td><td>7.92s</td><td>44.31s</td><td>4</td></tr><tr><td>IPS$^4$o</td><td><strong>4.8s</strong></td><td>0.19s</td><td>5.97s</td><td>5.21s</td><td>5.59s</td><td><strong>4.91s</strong></td><td>26.67s</td><td>2</td></tr></tbody></table></section><section id="autotoc_md5"><h3><a href="#autotoc_md5">Summary</a></h3><p>Extended benchmarks (detailed in forthcoming paper) shows that IPS4o (<a href="https://github.com/ips4o">https:/<wbr />/<wbr />github.com/<wbr />ips4o</a>) often surpasses PPQSort in raw speed. However, IPS4o relies on the external library oneTBB (<a href="https://github.com/oneapi-src/oneTBB">https:/<wbr />/<wbr />github.com/<wbr />oneapi-src/<wbr />oneTBB</a>) introducing integration complexities. PPQSort steps up as a compelling alternative due to its:</p><ul><li><strong>Competitive Speed:</strong> Delivers performance comparable to IPS4o on most machines.</li><li><strong>Hardware Agnostic:</strong> Maintains strong performance across various hardware, potentially surpassing IPS4o on specific systems, especially ARM platforms.</li><li><strong>Dependency-Free:</strong> No external libraries are required, simplifying integration. For applications demanding a fast, dependency-free parallel sorting solution, PPQSort is an excellent choice.</li></ul></section></section><section id="autotoc_md6"><h2><a href="#autotoc_md6">Running Tests and Benchmarks</a></h2><p>Bash script for running or building specific components:</p><pre class="m-code">$<span class="w"> </span>scripts/build.sh<span class="w"> </span>all
+...
+$<span class="w"> </span>scripts/run.sh<span class="w"> </span>standalone
+...</pre><p>Note that the benchmark&#x27;s CMake file will by default download sparse matrices (around 26GB).</p></section><section id="autotoc_md7"><h2><a href="#autotoc_md7">Implementation</a></h2><p>A detailed research paper exploring PPQSort&#x27;s design, implementation, and performance evaluation will be available soon.</p></section>
       </div>
     </div>
   </div>

Name	Algorithm	Memory usage	External dependencies	Highlight
PPQSort	Quicksort	in-place	None	parallel pattern quicksort algorithm
GCC BQS	Quicksort	in-place	OpenMP	allocating threads proportionally to subtask sizes
cpp11sort	Quicksort	in-place	None	Header-only, C++11 compliant
oneTBB parallel_sort	quicksort	out-place	oneTBB	Splits input to small tasks
poolSTL sort	Quicksort	in-place	None	Header-only, C++17 compliant
Boost block_indirect_sort	merging algorithm	out-place	Boost	Upper bounded small memory usage
AQsort	Quicksort	in-place	OpenMP	Allows the sorting of multiple datasets at once
MPQsort	Quicksort	in-place	OpenMP	Multiway Parallel Quicksort
IPS4o	Samplesort	in-place	oneTBB	Divides data into buckets and sort them recursively
Algorithm	Random	Ascending	Descending	Rotated	OrganPipe	Heap	Total	Rank
PPQSort C++	5.84s	1.84s	4.55s	1.38s	2.96s	5.58s	22.15s	1
GCC BQS	13.72s	4.18s	19.11s	49.89s	8.24s	13.78s	108.92s	6
oneTBB	43.66s	0.09s	8.62s	13.84s	8.12s	43.9s	118.23s	9
poolSTL	34.63s	5.61s	7.23s	14.78s	7.81s	46.88s	116.94s	7
MPQsort	13.35s	5.74s	5.77s	4.67s	7.71s	12.87s	50.11s	5
cpp11sort	9.58s	2.47s	2.66s	5.47s	3.42s	9.9s	33.5s	3
AQsort	24.72s	3.66s	23.14s	21.83s	22.6s	25.31s	121.26s	8
Boost	8.2s	3.0s	4.26s	13.96s	6.97s	7.92s	44.31s	4
IPS$^4$o	4.8s	0.19s	5.97s	5.21s	5.59s	4.91s	26.67s	2