Improved Memory Operations #174

ashvardanian · 2024-09-27T22:18:08Z

This update brings many performance optimizations before the next wave of breaking major releases with new functionality and wider range of CPUs supported. Time to get excited 🥳

Faster `memcpy` and `memset`

On Intel Sapphire Rapids:

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- memcpy<aligned>                          19.7128 GB/s       3404322.4 ns          0 errors in       7344 iterations                     
- sz_copy_serial<aligned>                  11.7727 GB/s       5700374.0 ns          0 errors in       4388 iterations                     
- sz_copy_avx512<aligned>                  20.0675 GB/s       3344156.1 ns          0 errors in       7476 iterations                     
- sz_copy_avx2<aligned>                    11.4429 GB/s       5864690.5 ns          0 errors in       4264 iterations                     
- memcpy<unaligned>                        19.4694 GB/s       3446883.2 ns          0 errors in       7256 iterations                     
- sz_copy_serial<unaligned>                11.6158 GB/s       5777373.4 ns          0 errors in       4328 iterations                     
- sz_copy_avx512<unaligned>                20.3848 GB/s       3292099.3 ns          0 errors in       7596 iterations                     
- sz_copy_avx2<unaligned>                  11.2894 GB/s       5944407.9 ns          0 errors in       4208 iterations                     
- memset                                   27.9879 GB/s       2397785.1 ns          0 errors in      10428 iterations                     
- sz_fill_serial                           28.0284 GB/s       2394315.1 ns          0 errors in      10444 iterations                     
- sz_fill_avx512                           28.9894 GB/s       2314942.1 ns          0 errors in      10800 iterations                     
- sz_fill_avx2                             27.7442 GB/s       2418845.8 ns          0 errors in      10336 iterations

On AWS Graviton 4 we still have room for improvement.
A potential improvement can come from non-temporal stores on large payloads.

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- memcpy<aligned>                          28.4008 GB/s       2362924.1 ns          0 errors in      10584 iterations                     
- sz_copy_serial<aligned>                  23.0014 GB/s       2917600.0 ns          0 errors in       8572 iterations                     
- sz_copy_sve<aligned>                     27.5536 GB/s       2435573.1 ns          0 errors in      10268 iterations                     
- sz_copy_neon<aligned>                    21.1320 GB/s       3175702.1 ns          0 errors in       7876 iterations                     
- memcpy<unaligned>                        26.9551 GB/s       2489652.6 ns          0 errors in      10044 iterations                     
- sz_copy_serial<unaligned>                22.6073 GB/s       2968456.4 ns          0 errors in       8424 iterations                     
- sz_copy_sve<unaligned>                   25.6073 GB/s       2620692.7 ns          0 errors in       9540 iterations                     
- sz_copy_neon<unaligned>                  20.8439 GB/s       3219593.9 ns          0 errors in       7768 iterations                     
- memset                                   66.9055 GB/s       1003039.9 ns          0 errors in      24928 iterations                     
- sz_fill_serial                           44.1775 GB/s       1519072.9 ns          0 errors in      16460 iterations                     
- sz_fill_sve                              34.5010 GB/s       1945126.1 ns          0 errors in      12856 iterations                     
- sz_fill_neon                             44.5696 GB/s       1505708.6 ns          0 errors in      16604 iterations

256-byte Look-Up Table Transform

On Intel Sapphire Rapids:

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- str::transform<lookup>                    3.8070 GB/s      17627743.2 ns          0 errors in       1420 iterations                     
- str::transform<increment>                23.9881 GB/s       2797588.7 ns          0 errors in       8940 iterations                     
- sz_look_up_transform_serial               3.6020 GB/s      18630895.7 ns          0 errors in       1344 iterations                     
- sz_look_up_transform_avx512              21.1733 GB/s       3169507.5 ns          0 errors in       7888 iterations                     
- sz_look_up_transform_avx2                 8.3881 GB/s       8000528.7 ns          0 errors in       3128 iterations

On AWS Graviton 4:

$ build_release/stringzilla_bench_memory leipzig1M.txt 
StringZilla. Starting memory benchmarks.
Parsed the dataset with:
- 8388608 words of mean length ~ 5.12 bytes
- 262144 lines of mean length ~ 128.64 bytes
Benchmarking on entire dataset:
- str::transform<lookup>                    2.6494 GB/s      25329887.2 ns          0 errors in        988 iterations                     
- str::transform<increment>                23.7150 GB/s       2829809.9 ns          0 errors in       8836 iterations                     
- sz_look_up_transform_serial               2.6069 GB/s      25742844.6 ns          0 errors in        972 iterations                     
- sz_look_up_transform_neon                 8.4908 GB/s       7903721.1 ns          0 errors in       3164 iterations

Closes #172 Co-authored-by: Takuya Hashimoto <toge@users.noreply.github.com>

…into main-dev

On the Leipzig1M dataset, LibC vs SZ: ~ 128b lines, aligned: 2.3 vs 2.6 GB/s ~ 128b lines, unaligned: 2.34 vs 2.53 GB/s ~ 5b tokens, aligned: 0.1 vs 0.1 GB/s ~ 5b tokens, unaligned: 0.1 vs 0.1 GB/s ~ 124 MB, aligned: 19.6 vs 20.3 GB/s ~ 124 MB, unaligned: 19.6 vs 20.3 GB/s

…into main-dev

Previously SZ would build too many targets for each debugging session.

This commit accelerates the `sz_fill_avx2` and `sz_copy_avx2` by avoiding unaligned writes. It also adds an `sz_equal_avx2` to help validate large files with matching checksums faster. It also adds a placeholder for `sz_order_avx2`, discouraging further optimizations. C++ API with a matching argument order was added to mimic `std::memcpy`, `std::memset`, `std::memmove`. Matching `test_memory_utilities` tests were extended.

In AVX-512, similar to GLibC we should use the register space to load more data simultaneously and avoid loops and data-dependency between iterations.

…into main-dev

The new `sz_look_up_transform` API implements a 256-byte lookup table using serial code and AVX-512 that can significantly accelerates text and image processing. The AVX-512 implementation reaches 18 GB/s on Intel Sapphire Rapids CPU, while serial code stays around 3 GB/s for large files.

ashvardanian and others added 15 commits September 27, 2024 16:47

Docs: Advanced contribution examples

d04993d

Docs: Weaknesses of LibC

172bf93

Fix: Missing, but documented partition(':')

432fb3d

Closes #172 Co-authored-by: Takuya Hashimoto <toge@users.noreply.github.com>

Docs: Names of charset helpers

a0e9be7

Docs: Avoid AppleClang

b5fcc62

Merge branch 'main-dev' of https://github.com/ashvardanian/StringZilla …

ee6f754

…into main-dev

Fix: Invoking wrong view constructor

97cf753

Make: -mfloat-abi=softfp for NEON

224a3a0

Make: Target arch=armv8.2-a+simd for NEON

97535bc

Fix: rfind_charset_avx2 compile-time dispatch

e4e138c

Merge branch 'main-dev' of https://github.com/ashvardanian/StringZilla …

a265d3b

…into main-dev

Improve: SZ-specific breakpoints

36df73d

Make: Lighter debugging in VS Code

5d522cf

Previously SZ would build too many targets for each debugging session.

Improve: Extend fuzzy testing

5388ab4

ashvardanian changed the title ~~Extending C++ API~~ StringZilla v4 Oct 1, 2024

ashvardanian changed the title ~~StringZilla v4~~ StringZilla v4 🦖🦖🦖🦖 Oct 1, 2024

ashvardanian added 8 commits October 1, 2024 22:28

Fix: Head and tail slicing in AVX-512

6d326d9

Improve: Extend drafts & exclude from bench

a383e9e

Add: STL-compatible memory APIs

69060ac

Improve: Using more registers for small moves

696797d

In AVX-512, similar to GLibC we should use the register space to load more data simultaneously and avoid loops and data-dependency between iterations.

Docs: Describe memory algorithms

e2f8cc7

Add: SVE kernels

02b9d68

Merge branch 'main-dev' of https://github.com/ashvardanian/StringZilla …

bba72a6

…into main-dev

ashvardanian changed the title ~~StringZilla v4 🦖🦖🦖🦖~~ Improved Memory Operations Oct 11, 2024

ashvardanian added 4 commits October 11, 2024 21:47

Add: sz_look_up_transform_avx2

014bcf2

Fix: Missing NEON kernels & casts

26a0fea

Fix: Missing std::aligned_alloc on Win32

423ad99

ashvardanian added 4 commits October 12, 2024 17:33

Fix: NEON cast

11272e5

Fix: Revert AVX-512 override

4d8ac78

Fix: Type casting warnings & free

82146b0

Fix: Explicit simsimd_size_t cast in SVE

165986f

ashvardanian force-pushed the main-dev branch from 6dacbb2 to 165986f Compare October 12, 2024 18:28

ashvardanian added 3 commits October 12, 2024 19:40

Fix: sz_move_neon order

be6c93b

Add: sz_look_up_transform_neon

3898481

Fix: Missing size_t SVE overload

1baa3a9

ashvardanian force-pushed the main-dev branch from 45dd093 to 1baa3a9 Compare October 12, 2024 21:00

Fix: -mfloat-abi=hard for Clang

1db702a

ashvardanian force-pushed the main-dev branch 3 times, most recently from 3d20005 to fb06b66 Compare October 12, 2024 21:27

ashvardanian added 3 commits October 12, 2024 16:31

Make: Override MACOSX_DEPLOYMENT_TARGET

5c1426f

Fix: Reorder tests

78937f9

Make: Hard-code GitHub CI OS versions

c0c1dcb

ashvardanian merged commit 87fae70 into main Oct 12, 2024
11 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Memory Operations #174

Improved Memory Operations #174

ashvardanian commented Sep 27, 2024 •

edited

Loading

Improved Memory Operations #174

Improved Memory Operations #174

Conversation

ashvardanian commented Sep 27, 2024 • edited Loading

Faster memcpy and memset

256-byte Look-Up Table Transform

ashvardanian commented Sep 27, 2024 •

edited

Loading

Faster `memcpy` and `memset`