-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance on Different Systems #3
Comments
Below are some benchmarks on my system. The memory usage of Read allChunk by chunk
|
@LDeakin that's quite in line with what I saw pre-security shutdown on my mac. The read-all made sense intuitively (rust plus a bit of overhead), the chunk-by-chunk was tougher to pin down, so good to see it's reproducible. Thanks so much for this. I think the memory usage/flat performance is accounted for by the fact that I'm hoovering up all available threads by default. I think there's an issue for making this configurable? Not sure what the best way was though, env variable or API but API is tough because of the current rust + python bridge not having a public API |
@LDeakin looking into the parallelism a bit on my end. We are basically following their directions to the T, at least on the rust side, if performance is our concern: https://pyo3.rs/v0.22.2/parallelism Something that pops out to me - any thoughts on why the sharding might be so performant across teh board for the compiled stuff? |
Re: teh above link, could also be some overhead of async + careless holding of the GIL as phil pointed out. Maybe we could release GIL and allow "true" python-level threading as the example shows? That doesn't really account for the |
Although that then doesn't account for the sharding, although that might not be working at all....I don't think I accounted for that so it's possible my code is erroring silently unless the code I have somehow accounts for it |
There are two areas where parallelism can be applied, in codecs and across chunks. Both Sharding is an example of a codec extremely well-suited to parallel encoding/decoding. That benchmark has many "inner chunks" per "shard" (chunk), so the cores are getting well utilised by the compiled implementations even if only decoding 1 chunk at a time. I'd assume Relevant documentation for
If parallelism is external to |
Also looking at that benchmark again, do you have an idea of where the large allocation (8.7GB) is occurring in |
Quoting myself... but thrashing is not the right term here. That does not really happen with Rayon work stealing. It is more just a suboptimal work distribution. Defaults might be perfectly okay! |
The current benchmark on my mac differs wildly from that of linux...not much more to say. A lot of users on mac, would be great to understand this,
The text was updated successfully, but these errors were encountered: