Performance on Different Systems #3

ilan-gold · 2024-09-26T08:52:16Z

The current benchmark on my mac differs wildly from that of linux...not much more to say. A lot of users on mac, would be great to understand this,

LDeakin · 2024-09-28T07:58:56Z

Below are some benchmarks on my system. The memory usage of zarrs-python is curious.

Read all

Chunk by chunk

	Image	Concurrency	Time (s) zarrs rust	tensorstore python	zarr python	zarrs python	Memory (GB) zarrs rust	tensorstore python	zarr python	zarrs python
0	data/benchmark.zarr	1	28.98	53.38	88.01	59.04	0.03	0.10	0.10	0.12
1	data/benchmark.zarr	2	14.77	30.28	75.53	46.94	0.03	0.31	0.31	8.72
2	data/benchmark.zarr	4	8.19	23.61	73.30	47.39	0.03	0.31	0.31	8.72
3	data/benchmark.zarr	8	4.35	22.59	70.71	49.45	0.03	0.31	0.32	8.72
4	data/benchmark.zarr	16	2.82	21.37	62.97	48.89	0.03	0.34	0.32	8.72
5	data/benchmark.zarr	32	2.48	19.34	58.23	47.42	0.03	0.34	0.32	8.72
6	data/benchmark_compress.zarr	1	22.88	47.06	101.05	51.15	0.03	0.10	0.13	0.12
7	data/benchmark_compress.zarr	2	12.57	28.03	94.76	38.57	0.03	0.32	0.34	8.72
8	data/benchmark_compress.zarr	4	7.03	23.53	95.64	38.65	0.03	0.32	0.34	8.71
9	data/benchmark_compress.zarr	8	3.92	21.15	84.48	37.77	0.03	0.32	0.34	8.72
10	data/benchmark_compress.zarr	16	2.26	19.33	77.08	39.26	0.04	0.34	0.34	8.72
11	data/benchmark_compress.zarr	32	2.05	17.30	70.61	38.28	0.04	0.35	0.35	8.71
12	data/benchmark_compress_shard.zarr	1	2.17	2.73	33.60	3.37	0.37	0.60	0.89	0.68
13	data/benchmark_compress_shard.zarr	2	1.62	2.26	28.78	3.67	0.70	0.90	1.40	8.81
14	data/benchmark_compress_shard.zarr	4	1.39	2.04	28.45	3.71	1.30	1.07	2.43	8.80
15	data/benchmark_compress_shard.zarr	8	1.35	1.93	27.81	3.60	2.36	1.43	4.72	8.81
16	data/benchmark_compress_shard.zarr	16	1.44	2.69	27.68	3.42	4.43	1.74	9.27	8.80
17	data/benchmark_compress_shard.zarr	32	2.07	2.20	31.37	3.41	6.66	2.94	18.41	8.81

ilan-gold · 2024-09-28T08:20:45Z

@LDeakin that's quite in line with what I saw pre-security shutdown on my mac. The read-all made sense intuitively (rust plus a bit of overhead), the chunk-by-chunk was tougher to pin down, so good to see it's reproducible. Thanks so much for this. I think the memory usage/flat performance is accounted for by the fact that I'm hoovering up all available threads by default. I think there's an issue for making this configurable? Not sure what the best way was though, env variable or API but API is tough because of the current rust + python bridge not having a public API

ilan-gold · 2024-10-02T20:17:53Z

@LDeakin looking into the parallelism a bit on my end. We are basically following their directions to the T, at least on the rust side, if performance is our concern: https://pyo3.rs/v0.22.2/parallelism

Something that pops out to me - any thoughts on why the sharding might be so performant across teh board for the compiled stuff?

ilan-gold · 2024-10-02T20:58:52Z

Re: teh above link, could also be some overhead of async + careless holding of the GIL as phil pointed out. Maybe we could release GIL and allow "true" python-level threading as the example shows? That doesn't really account for the concurrent_chunks=1 difference though...so I'd guess our overhead is in the extraction of the python types. I read that declaring types ahead of time can be a boost so it might be worth trying that.

ilan-gold · 2024-10-02T20:59:21Z

Although that then doesn't account for the sharding, although that might not be working at all....I don't think I accounted for that so it's possible my code is erroring silently unless the code I have somehow accounts for it

LDeakin · 2024-10-02T21:21:07Z

Something that pops out to me - any thoughts on why the sharding might be so performant across teh board for the compiled stuff?

There are two areas where parallelism can be applied, in codecs and across chunks. Both zarrs and tensorstore use all available cores (where possible/efficient) by default. That chunk-by-chunk benchmark limits the number of chunks decoded concurrently, but still uses all available threads for decoding.

Sharding is an example of a codec extremely well-suited to parallel encoding/decoding. That benchmark has many "inner chunks" per "shard" (chunk), so the cores are getting well utilised by the compiled implementations even if only decoding 1 chunk at a time. I'd assume zarr-python sharding is single-threaded.

Relevant documentation for zarrs:

If parallelism is external to zarrs (e.g. multiple concurrent Array::retrieve_ ops), it would be preferable to reduce the concurrent target to avoid potential thrashing. This can be done for individual retrieve/store operations with CodecOptions or by setting the global configuration.

LDeakin · 2024-10-02T21:26:35Z

Also looking at that benchmark again, do you have an idea of where the large allocation (8.7GB) is occurring in zarrs-python when concurrent chunks > 1?

LDeakin · 2024-10-02T21:33:01Z

it would be preferable to reduce the concurrent target to avoid potential thrashing

Quoting myself... but thrashing is not the right term here. That does not really happen with Rayon work stealing. It is more just a suboptimal work distribution. Defaults might be perfectly okay!

ilan-gold added benchmark performance investigation labels Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance on Different Systems #3

Performance on Different Systems #3

ilan-gold commented Sep 26, 2024

LDeakin commented Sep 28, 2024 •

edited

Loading

ilan-gold commented Sep 28, 2024

ilan-gold commented Oct 2, 2024

ilan-gold commented Oct 2, 2024

ilan-gold commented Oct 2, 2024 •

edited

Loading

LDeakin commented Oct 2, 2024

LDeakin commented Oct 2, 2024

LDeakin commented Oct 2, 2024

Performance on Different Systems #3

Performance on Different Systems #3

Comments

ilan-gold commented Sep 26, 2024

LDeakin commented Sep 28, 2024 • edited Loading

Read all

Chunk by chunk

ilan-gold commented Sep 28, 2024

ilan-gold commented Oct 2, 2024

ilan-gold commented Oct 2, 2024

ilan-gold commented Oct 2, 2024 • edited Loading

LDeakin commented Oct 2, 2024

LDeakin commented Oct 2, 2024

LDeakin commented Oct 2, 2024

LDeakin commented Sep 28, 2024 •

edited

Loading

ilan-gold commented Oct 2, 2024 •

edited

Loading