-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate why Solana is so slow. #6
Comments
Here is the generated index file for one of the particularly problematic crates that was filtered out. |
Focusing in on crates that had a high ratio of runtime to Edit: solana-transaction-status@1.3.17 was fixed with 7d6a3f0 |
It seems that there is a lot a backtracking in the cargo resolver for specific versions of After checking the dependencies, I can see why:
Since the resolver tries to get the most recent compatible version if possible, if we set For versions |
I guess I'm missing a detail, for versions The general question is can we find a smaller example that has the same reason for being slow? The easiest to debug are samples where every crate has a regular structure, and the overall runtime is exponential (or at least cubic) in the number of crates/versions. But those are hard to create from real-world examples. I've also been able to find the problem by identifying a small change that makes a big difference. Like "removing this one feature of one dependency of one crate makes things twice as fast" help me track down the solana-transaction-status@1.3.17 problem. Looking at crates with a high |
By looking at the logs using Here is a test showing this behavior: #[test]
fn test() {
let mut reg = Vec::new();
const V: usize = 100;
reg.push(pkg_dep_with(("cc", "1.0.0"), vec![], &[("f", &[])]));
for v in 1..V {
reg.push(pkg(("cc", format!("1.0.{v}"))));
}
reg.push(pkg_dep(
("solana-core", "1.0.0"),
vec![
dep("cc"), // resolve is instantaneous without this dependency
dep_req("solana-1", "1.0.0"),
],
));
const N: usize = 100;
const C: usize = 3;
for v in 0..N {
let version = format!("1.0.{v}");
for c in 1..C {
reg.push(pkg_dep(
(format!("solana-{c}"), &version),
vec![dep_req(&format!("solana-{}", c + 1), &version)],
));
}
reg.push(pkg_dep(
(format!("solana-{C}"), &version),
vec![dep_req("cc", "1.0.0").with(&["f"])],
));
}
let deps = vec![dep("solana-core")];
let res = resolve(deps, ®).unwrap();
println!("{:?}", res);
} Here is the results with various values of With We can see that when
When plugging these test cases in the SAT resolver, there is no slowdown at all: a solution is always found in less than 10ms. Another thing: |
When removing the |
(Note: I tried to run the sat resolver using infos from the |
After more tests, it seems that the threshold occurs when So in the example above, setting |
That is very intriguing results! I wonder if similar slowdowns happen with the PubGrub based resolver from this crate.
It is really unfortunate that we have two different testing infrastructures one here and one in Cargo's repo. I wonder how we could bring them closer to being inter-compatible.
There are various heuristics that use "smallest number of candidates" to decide which requirements to resolve first. I wonder if that's why this cut off occurs. |
After a quick perf, 70% of the time is spent hashing in this structure: |
I like to collect perf results with https://github.com/mstange/samply so that the results can be shared with just a few clicks. |
Here is two |
That seems like an easy win! I look forward to your a PR to Cargo. |
For this test it takes less than 2s to resolve using the PubGrub resolver with But I observe the same threshold when |
I don't know if this is helpful, but I just ran this project for the Solana crates. out_solana.zip here's a CSV of the output. There may be valuable insight by comparing consecutive versions of crates. For example:
If you can push the |
Here is my branch: https://github.com/x-hgg-x/cargo/tree/resolver-perf
Firefox perf profiles: perf.tar.gz |
Running all crates in the index went from |
Versions between (Note: I needed to write a small Rust program to sort the csv by increasing name and version to better analyze it, so it should be done before writing the csv.) |
|
Improve resolver speed ### What does this PR try to resolve? This PR improves the resolver speed after investigations from Eh2406/pubgrub-crates-benchmark#6. ### How should we test and review this PR? Commit 1 adds a test showing a slow case in the resolver, where resolving time is cubic over the number of versions of a crate. It can be used for benchmarking the resolver. Comparison of the resolving time with various values of `N=LAST_CRATE_VERSION_COUNT` and `C=TRANSITIVE_CRATES_COUNT`: | | N=100 | N=200 | N=400 | |------|-------|-------|-------| | C=1 | 0.25s | 0.5s | 1.4s | | C=2 | 7s | 44s | 314s | | C=3 | 12s | 77s | 537s | | C=6 | 30s | 149s | 1107s | | C=12 | 99s | 447s | 2393s | Commit 2 replaces the default hasher with the hasher from `rustc-hash`, decreasing resolving time by more than 50%. Performance comparison with the test from commit 1 by setting `LAST_CRATE_VERSION_COUNT = 100`: | commit | duration | |-----------------|----------| | master | 16s | | with rustc-hash | 6.7s | Firefox profiles, can be read with https://profiler.firefox.com: [perf.tar.gz](https://github.com/user-attachments/files/17318243/perf.tar.gz) r? Eh2406
Improve resolver speed ### What does this PR try to resolve? This PR improves the resolver speed after investigations from Eh2406/pubgrub-crates-benchmark#6. ### How should we test and review this PR? Commit 1 adds a test showing a slow case in the resolver, where resolving time is cubic over the number of versions of a crate. It can be used for benchmarking the resolver. Comparison of the resolving time with various values of `N=LAST_CRATE_VERSION_COUNT` and `C=TRANSITIVE_CRATES_COUNT`: | | N=100 | N=200 | N=400 | |------|-------|-------|-------| | C=1 | 0.25s | 0.5s | 1.4s | | C=2 | 7s | 44s | 314s | | C=3 | 12s | 77s | 537s | | C=6 | 30s | 149s | 1107s | | C=12 | 99s | 447s | 2393s | Commit 2 replaces the default hasher with the hasher from `rustc-hash`, decreasing resolving time by more than 50%. Performance comparison with the test from commit 1 by setting `LAST_CRATE_VERSION_COUNT = 100`: | commit | duration | |-----------------|----------| | master | 16s | | with rustc-hash | 6.7s | Firefox profiles, can be read with https://profiler.firefox.com: [perf.tar.gz](https://github.com/user-attachments/files/17318243/perf.tar.gz) r? Eh2406
For This may explain the long time spent dropping the |
By moving the conflict cache out of the loop, the resolving time for |
And the resolving time for |
The loop has to do with requesting data from the network. By processing all the existing data it determines what new data is required. Then it waits for that data to be available and repeats. This should all be deterministic. Whether the conflict cash can be shared will require more thought than I can give it at the moment. It has to do with whether new facts could ever invalidate a conflict that sorry been registered. (I think you're right that it can be shared, but I'm not confident.) I don't see how it could affect the long dropping time for And it definitely is not related to the slow times were observing when running this tool, because this tool peloads all of the data and does not do that loop. |
The
Here is two consecutive runs for Run 1 (236s):
Run 2 (310s):
Yes, so I should use this repo to do my benchmarks. |
Ah, yes that makes sense. Thanks.
One thing we observed in this repo is that using
There's definitely non-determinism. When looking at individual packages, I've been observing it too. I'm not convinced that the restarts are the cause of the non-determinism, or if there's another root cause we should be looking for.
I didn't mean to go that far. On one hand, this repo is set up to make it |
There is an option in rustc to enable
There is at least a difference between a fully offline registry where everything is always ready and the online registry which needs to download from
It's all good, I also think that both approaches have their own advantages. |
I have found the non-determinist bug: this code is incorrect:
We can see it in the above log:
|
So that's what used for the compiler and LLVM. The next question is what used for cargo? Which appears to be the system default allocator https://github.com/search?q=repo%3Arust-lang%2Fcargo%20global_allocator&type=code We should try jemalloc and see if it helps!
That's very intriguing. There are several layers of caching that might explain this difference. The most direct is the reuse of the RegistryQueryer, but also registries themselves cash reading from disk. I don't know if that's enough to explain a 2x difference.
In my understanding, the final loop of resolution and the only loop when off-line should be identical. Turning on logging and comparing the output between the two different scenarios should make this clear. ( Or adding debug statements that print out the precisely relevant info if the logs are too verbose. ) If they're not the same, I would really love to understand why! By definition, it's the final loop because nothing returned But the more I talk about it, the more I'm convinced that you are right about being able to share the conflict cash. The code is set up so that if an incremental resolution fails then we know that waiting for the network will not cause it to succeed, we can report it to the user as soon as we find the error. Which is incompatible with the conflict cash storing a problem that would be invalidated by future network data. |
Here is the conflict cache logs for We can see that iteration 4 to 8 are the same using the online registry, and the conflict cache using the offline registry is much bigger than for the iteration 8. Here is the conflict cache log when sharing the conflict cache between iterations: |
That is so interesting! I wonder why the last loop didn't match the full run. It's always so intriguing to discover one's knowledge is incomplete. Thank you! I wonder where the two runs first diverge? Not just for the conflict clauses they had, but for every decision they make. The fact that for 4 iterations we generate the same conflicts, definitely backs up your measured evidence of huge performance savings by sharing it. Add that to your list of future PR's! Due to the holiday, I will probably have limited availability to converse on the subject, despite its deep interest to me, until Monday. |
After analysis the difference is that adding |
When using an online registry, deps which are not yet fetched are not added to the candidate summary: On the other hand, if a dependency query returns This means we cannot add a conflict in the cache which would be incompatible with future fetches, and therefore the conflict cache should be able to be shared. Comparing to an offline registry or the last iteration for an online registry, doing the resolve with a shared conflict cache and an online registry makes the order of operations slightly differs: the content of the cache is different but the resolve result after the last iteration is the same. Here is the sorted conflict cache logs for when sharing it: shared_sorted.log |
Here is the perf profiles for 4 profiles are compared:
Output of
We see that the next thing to optimize will be the hashing of |
I am convinced. Looking forward to seeing a PR for this. Looks like |
fix(resolver): share conflict cache between activation retries ### What does this PR try to resolve? Found in Eh2406/pubgrub-crates-benchmark#6. Currently the resolve is done in a loop, restarting if there are pending dependencies which are not yet fetched, and recreating the conflict cache at each iteration. This means we do a lot of duplicate work, especially if the newly fetched dependencies has zero or few dependencies which doesn't conflict. This also means that the maximum depth of the dependency graph has a big impact on the total resolving time, since adding a dependency not yet seen means we will restart the resolve at the next iteration. Here is the output of the resolve for the master branch using `solana-core = "=1.2.18"` in `Cargo.toml`, printing the duration from the start after each iteration: ``` Updating crates.io index 103.142341ms [=====> ] 1 complete; 0 pending 118.486144ms [========> ] 2 complete; 0 pending 785.389055ms [===========> ] 62 complete; 0 pending 4.916033377s [==============> ] 277 complete; 0 pending 9.13101404s [=================> ] 439 complete; 0 pending 50.083251549s [====================> ] 520 complete; 0 pending 133.401303265s [=======================> ] 561 complete; 0 pending 214.120483345s [==========================> ] 565 complete; 0 pending 294.180677785s [=============================> ] 566 complete; 0 pending Locking 536 packages to latest compatible versions ``` To improve the situation, this PR moves the conflict cache outside the activation loop so it can be reused for all iterations. This is possible since when using an online registry, dependencies which are not yet fetched are not added to the candidate summary: https://github.com/rust-lang/cargo/blob/82c489f1c612096c2dffc48a4091d21af4b8e046/src/cargo/core/resolver/dep_cache.rs#L248-L266 On the other hand, if a dependency query returns `Poll::Ready`, then all compatible summaries are returned, so we cannot have a partial view where only some compatible summaries would be returned. This means we cannot add a conflict in the cache which would be incompatible with future fetches, and therefore the conflict cache can be shared. Here is the output of the resolve with this PR: ``` Updating crates.io index 98.239126ms [=====> ] 1 complete; 0 pending 127.528982ms [========> ] 2 complete; 0 pending 821.601257ms [===========> ] 62 complete; 0 pending 4.67917132s [==============> ] 277 complete; 0 pending 5.933230172s [=================> ] 431 complete; 0 pending 14.74321904s [====================> ] 508 complete; 0 pending 24.607428893s [=======================> ] 546 complete; 0 pending 24.700610469s [==========================> ] 550 complete; 0 pending 24.757651875s [=============================> ] 551 complete; 0 pending Locking 536 packages to latest compatible versions ``` Besides the huge performance savings between iterations, sharing the conflict cache has the side effect of eliminating dead paths earlier in the resolve after a restart. We can see that the duration of all iterations using this PR is less than the duration of the last iteration on master, and that it needs to resolve less dependencies overall.
fix(resolver): avoid cloning when iterating using RcVecIter ### What does this PR try to resolve? Follow up of Eh2406/pubgrub-crates-benchmark#6 (comment). This PR modifies the internal `RcVecIter` struct so that we can iterate on its items without cloning them. This improves performance of the resolver because eagerly cloning `Arc<T>` is not free. Comparison of performance using `solana-core = "=1.0.5"` in `Cargo.toml`: | branch | duration | |------------|----------| | master | 213s | | PR | 202s | r? Eh2406
Thank you for all the work you've done so far. It's made a dramatic improvement to the real-world performance for users of cargo. The Jewish holy days come altogether this time of year. I will be unable to respond until Monday. I promise, it's almost over just this week and next week, and then I'll be back to real availability. In addition to my lack of availability, Eric and Ed also have limited availability for a while. (Vacation and a conference in China respectively.) You seem to be coming up with ways to improve things on your own, I'm excited to see what else you come up with. If you're looking for things to do, the next step on the SAT tests would be adding features to PropTest. Alternatively, you could investigate the bug you made such a clear report for. If you want to stay focused on performance, I had a thought to reduce the drop time of
In this issue before we have also had some other leads. Which you could follow-up on if you wanted to. Like trying to create a minimal example for why some Solana crates take >20 sec on the PubGrub resolver. Or trying to improve interoperability between the testing infrastructure here and in the Cargo repo. Anyway just some ideas and encouragement. |
I experimented with the I have done two main improvements:
By doing all this, there is almost no remaining allocations in the resolver hot loop, so I expect the resolving time to be divided by 5 to 10. Flamegraph before (total time 30s): flamegraph.1.svg.tar.gz Flamegraph when using u32 as package type (total time 15s): flamegraph.2.svg.tar.gz Flamegraph when using u64 as version set: TODO |
During this process I also found how to implement weak dependencies in pubgrub. We need to lower this clause: For package
|
Thank you for spiking those out! I'm still hoping to get to an understanding of why these versions are algorithmically so much slower than similar nearby versions. In the meantime I'm happy to see any kind of speed up the comes along! I did not have a chance to fully understand your fork. But the first suggestion sounds similar to pubgrub-rs/pubgrub#242. Similarly, at one time I had a branch that did thread local interning in this project. Given that your seeing a 50% improvement, I should really rebased both of those efforts. Using a bitset as a |
Yes, but processing the requirement once par package per call to |
These two lines seem incorrect: They cause the following tests to fail: #[test]
fn test_is_disjoint() {
let p1 = Term::Positive(Range::<u32>::higher_than(20u32));
let n1 = Term::Negative(p1.unwrap_positive().complement());
let p2 = Term::Positive(Range::<u32>::lower_than(10u32));
let n2 = Term::Negative(p2.unwrap_positive().complement());
assert!(p1.is_disjoint(&p2)); // true
assert!(p1.is_disjoint(&n2)); // true
assert!(n1.is_disjoint(&p2)); // true
assert!(n1.is_disjoint(&n2)); // false
}
#[test]
fn test_subset_of() {
let p1 = Term::Positive(Range::<u32>::higher_than(20u32));
let n1 = Term::Negative(p1.unwrap_positive().complement());
let p2 = Term::Positive(Range::<u32>::higher_than(10u32));
let n2 = Term::Negative(p2.unwrap_positive().complement());
assert!(p1.subset_of(&p2)); // true
assert!(p1.subset_of(&n2)); // true
assert!(n1.subset_of(&p2)); // false
assert!(n1.subset_of(&n2)); // true
} As the docs, But there is also a special handling when the term is positive: https://github.com/pubgrub-rs/pubgrub/blob/be526674f1039309be04da191abd822a7d00bfeb/src/internal/partial_solution.rs#L246 Can |
Sorry for the slow reply. The holidays are over so hopefully I can be more prompt going forward.
Everyone has been confused by this when learning the PubGrub code, myself included. The rest of the paragraph you linked to tries to explain when |
I made a PR to improve documentation: pubgrub-rs/pubgrub#264. |
I implemented multiple optimizations in pubgrub in https://github.com/x-hgg-x/pubgrub/tree/perf (including the two above optimizations), and I applied it for cargo resolution in a new repo: https://github.com/x-hgg-x/pubgrub-bench. Resolution time for
Sorted list of packages in the solution: solution.txt.tar.gz |
Associated PR: pubgrub-rs/pubgrub#274 |
Currently the Solana crates are filtered out of our test suite here. For some reason they take an extraordinarily long time to resolve (both for the existing resolver and even more so for PubGrub). It is not clear why they are difficult. Presumably they have some share dependency with some pattern that the resolvers do not like. What dependency, what pattern, we do not know. Perhaps if we had a smaller example that still exemplified the problematic behavior it would be easier to see the problem.
The text was updated successfully, but these errors were encountered: