Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

state sync: improvements from mocknet testing #12507

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

saketh-are
Copy link
Collaborator

@saketh-are saketh-are commented Nov 24, 2024

This PR contains several fixes which improve the speed and robustness of state sync:

  • All part requests to peers are now made before all cloud attempts. Previously we focused on obtaining specific parts one by one, which could cause a thread to block for a long time until a particular part was uploaded to cloud storage. It takes tens of minutes after the epoch ends for dumper nodes to write all state parts to cloud storage, whereas peer hosts are ready to serve all requests as soon as the epoch ends.
  • Part request order is randomized at each syncing node, preventing spikes in demand to specific hosts.
  • Removes an unnecessary check for state headers when serving state parts. In some cases this was preventing peer hosts which do not track all shards from responding successfully to part requests.

Before these changes, it took up to 75 minutes for nodes to download parts for the largest shard (38.8 GiB in 1324 parts). After these changes:

  • Nodes consistently finish downloading parts in under 15 min,
  • State requests to peer hosts have a failure rate below 1%,
  • and 100% of parts are successfully obtained from peer hosts within three requests.
Screenshot 2024-11-24 at 7 17 39 AM

Additional minor improvements:

  • Adds a separate config parameter to specify how long to wait after a failed cloud download. This allows nodes to avoid spamming requests to cloud storage before parts have been uploaded.
  • Adds metrics recording the number of cache hits and misses when serving state part requests.
  • Distinguishes different types of errors collected in the near_state_sync_download_result metric.

Closes issues #12497, #12498, #12499

Once merged this PR should be cherry-picked into the 2.4 release. cc @staffik

Copy link

codecov bot commented Nov 24, 2024

Codecov Report

Attention: Patch coverage is 85.83333% with 17 lines in your changes missing coverage. Please review.

Project coverage is 69.97%. Comparing base (af6559a) to head (1dd1435).

Files with missing lines Patch % Lines
chain/chain/src/metrics.rs 50.00% 7 Missing ⚠️
chain/client/src/sync/state/network.rs 0.00% 4 Missing ⚠️
chain/client/src/sync/state/external.rs 84.21% 2 Missing and 1 partial ⚠️
chain/client/src/sync/state/downloader.rs 90.47% 1 Missing and 1 partial ⚠️
chain/chain/src/chain.rs 50.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #12507       +/-   ##
===========================================
+ Coverage    1.59%   69.97%   +68.38%     
===========================================
  Files         667      838      +171     
  Lines      118261   169224    +50963     
  Branches   118261   169224    +50963     
===========================================
+ Hits         1883   118412   +116529     
+ Misses     116276    45678    -70598     
- Partials      102     5134     +5032     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.28% <5.08%> (+<0.01%) ⬆️
linux 69.26% <80.00%> (+67.67%) ⬆️
linux-nightly 69.52% <85.83%> (?)
macos 51.08% <12.71%> (?)
pytests 1.59% <5.08%> (+<0.01%) ⬆️
sanity-checks 1.40% <5.08%> (+<0.01%) ⬆️
unittests 69.79% <85.83%> (?)
upgradability 0.20% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@saketh-are saketh-are marked this pull request as ready for review November 24, 2024 17:41
@saketh-are saketh-are requested a review from a team as a code owner November 24, 2024 17:41
Copy link
Contributor

@VanBarbascu VanBarbascu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

}
res
}
.instrument(tracing::debug_span!("StateSyncDownloader::ensure_shard_part_downloaded"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change name of debug span

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants