Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on Different Systems #3

Open
ilan-gold opened this issue Sep 26, 2024 · 8 comments
Open

Performance on Different Systems #3

ilan-gold opened this issue Sep 26, 2024 · 8 comments

Comments

@ilan-gold
Copy link
Owner

The current benchmark on my mac differs wildly from that of linux...not much more to say. A lot of users on mac, would be great to understand this,

@LDeakin
Copy link
Collaborator

LDeakin commented Sep 28, 2024

Below are some benchmarks on my system. The memory usage of zarrs-python is curious.

Read all

image

Chunk by chunk

image

Image Concurrency Time (s)
zarrs
rust

tensorstore
python

zarr
python

zarrs
python
Memory (GB)
zarrs
rust

tensorstore
python

zarr
python

zarrs
python
0 data/benchmark.zarr 1 28.98 53.38 88.01 59.04 0.03 0.10 0.10 0.12
1 data/benchmark.zarr 2 14.77 30.28 75.53 46.94 0.03 0.31 0.31 8.72
2 data/benchmark.zarr 4 8.19 23.61 73.30 47.39 0.03 0.31 0.31 8.72
3 data/benchmark.zarr 8 4.35 22.59 70.71 49.45 0.03 0.31 0.32 8.72
4 data/benchmark.zarr 16 2.82 21.37 62.97 48.89 0.03 0.34 0.32 8.72
5 data/benchmark.zarr 32 2.48 19.34 58.23 47.42 0.03 0.34 0.32 8.72
6 data/benchmark_compress.zarr 1 22.88 47.06 101.05 51.15 0.03 0.10 0.13 0.12
7 data/benchmark_compress.zarr 2 12.57 28.03 94.76 38.57 0.03 0.32 0.34 8.72
8 data/benchmark_compress.zarr 4 7.03 23.53 95.64 38.65 0.03 0.32 0.34 8.71
9 data/benchmark_compress.zarr 8 3.92 21.15 84.48 37.77 0.03 0.32 0.34 8.72
10 data/benchmark_compress.zarr 16 2.26 19.33 77.08 39.26 0.04 0.34 0.34 8.72
11 data/benchmark_compress.zarr 32 2.05 17.30 70.61 38.28 0.04 0.35 0.35 8.71
12 data/benchmark_compress_shard.zarr 1 2.17 2.73 33.60 3.37 0.37 0.60 0.89 0.68
13 data/benchmark_compress_shard.zarr 2 1.62 2.26 28.78 3.67 0.70 0.90 1.40 8.81
14 data/benchmark_compress_shard.zarr 4 1.39 2.04 28.45 3.71 1.30 1.07 2.43 8.80
15 data/benchmark_compress_shard.zarr 8 1.35 1.93 27.81 3.60 2.36 1.43 4.72 8.81
16 data/benchmark_compress_shard.zarr 16 1.44 2.69 27.68 3.42 4.43 1.74 9.27 8.80
17 data/benchmark_compress_shard.zarr 32 2.07 2.20 31.37 3.41 6.66 2.94 18.41 8.81

@ilan-gold
Copy link
Owner Author

@LDeakin that's quite in line with what I saw pre-security shutdown on my mac. The read-all made sense intuitively (rust plus a bit of overhead), the chunk-by-chunk was tougher to pin down, so good to see it's reproducible. Thanks so much for this. I think the memory usage/flat performance is accounted for by the fact that I'm hoovering up all available threads by default. I think there's an issue for making this configurable? Not sure what the best way was though, env variable or API but API is tough because of the current rust + python bridge not having a public API

@ilan-gold
Copy link
Owner Author

@LDeakin looking into the parallelism a bit on my end. We are basically following their directions to the T, at least on the rust side, if performance is our concern: https://pyo3.rs/v0.22.2/parallelism

Something that pops out to me - any thoughts on why the sharding might be so performant across teh board for the compiled stuff?

@ilan-gold
Copy link
Owner Author

Re: teh above link, could also be some overhead of async + careless holding of the GIL as phil pointed out. Maybe we could release GIL and allow "true" python-level threading as the example shows? That doesn't really account for the concurrent_chunks=1 difference though...so I'd guess our overhead is in the extraction of the python types. I read that declaring types ahead of time can be a boost so it might be worth trying that.

@ilan-gold
Copy link
Owner Author

ilan-gold commented Oct 2, 2024

Although that then doesn't account for the sharding, although that might not be working at all....I don't think I accounted for that so it's possible my code is erroring silently unless the code I have somehow accounts for it

@LDeakin
Copy link
Collaborator

LDeakin commented Oct 2, 2024

Something that pops out to me - any thoughts on why the sharding might be so performant across teh board for the compiled stuff?

There are two areas where parallelism can be applied, in codecs and across chunks. Both zarrs and tensorstore use all available cores (where possible/efficient) by default. That chunk-by-chunk benchmark limits the number of chunks decoded concurrently, but still uses all available threads for decoding.

Sharding is an example of a codec extremely well-suited to parallel encoding/decoding. That benchmark has many "inner chunks" per "shard" (chunk), so the cores are getting well utilised by the compiled implementations even if only decoding 1 chunk at a time. I'd assume zarr-python sharding is single-threaded.

Relevant documentation for zarrs:

If parallelism is external to zarrs (e.g. multiple concurrent Array::retrieve_ ops), it would be preferable to reduce the concurrent target to avoid potential thrashing. This can be done for individual retrieve/store operations with CodecOptions or by setting the global configuration.

@LDeakin
Copy link
Collaborator

LDeakin commented Oct 2, 2024

Also looking at that benchmark again, do you have an idea of where the large allocation (8.7GB) is occurring in zarrs-python when concurrent chunks > 1?

@LDeakin
Copy link
Collaborator

LDeakin commented Oct 2, 2024

it would be preferable to reduce the concurrent target to avoid potential thrashing

Quoting myself... but thrashing is not the right term here. That does not really happen with Rayon work stealing. It is more just a suboptimal work distribution. Defaults might be perfectly okay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants