Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix OSS sealunwrapper adding extra get + put request to all storage get requests #29050

Merged
merged 2 commits into from
Nov 29, 2024

Conversation

andybao-dd
Copy link
Contributor

Problem

Starting after commit 0ec3524, for each GET request Vault makes to the storage backend, the OSS seal unwrapper will make 1 additional GET request plus 1 additional PUT request. This is because the seal unwrapper now tries to unwrap all storage entries, even if the entry was not wrapped in the first place! Prior to 0ec3524, there was a check to see if the storage entry was wrapped before proceeding with the unwrap procedure, which prevented this behavior.

The increased storage latency, writes, and traffic resulting from the extra requests causes noticeable performance degradation for some workloads. 0ec3524 was first included in Vault v1.15.0, and after upgrading from Vault v1.14.9 to v1.15.6, some of our Vault clusters would take 2-3x longer to start up.

Other users have also reported storage-related issues after upgrading to Vault v1.15+: #24726.

Description

This PR restores the check to prevent Vault from unwrapping a storage entry if the storage entry is not wrapped. This should eliminate the extra GET and PUT requests.

Testing

A small test is included in this PR to confirm that at least the extra PUT request is no longer being made.

I also ran some benchmarks using vault-benchmark (awesome tool BTW!) + a small wrapper script. The chart below shows the p90 of the time needed to perform different benchmark operations on a sample Vault instance. The Vault instance was configured with Amazon S3 for storage and caching was explicitly disabled.

p90 operation latency for Vault on S3 with caching disabled, compared across multiple Vault versions
[📊 Additional benchmark results can be found here.]

From the graph, it seems like:

  • Operations on commit 0ec3524 of Vault (labeled post-0ec3524 in the graph) take significantly longer compared to operations made on a build of the parent commit (labeled pre-0ec3524).
  • As of today, Vault built off of the main branch continues to exhibit similar slow operation issues.
  • Vault built off of this PR branch (labeled main+fix) shows improved performance, matching the performance level of pre-0ec3524.

Other notes

Why did you disable caching during the benchmarks?

Due to how vault-benchmark benchmarks are structured, it is difficult to evaluate storage performance without disabling caching. vault-benchmark benchmarks start off with an empty cluster and then write a bunch of state to it. Then they might read that state back or perform other write operations.

The problem is that with a few exceptions, all write operations will write to both the storage and the in-memory cache. So by starting off with an empty cluster and then setting up the state by performing writes, we have effectively written the entire cluster state to the in-memory cache. Therefore most read benchmarks we perform will be benchmarking the in-memory cache instead of the actual storage codepath.

I have still included the benchmark results with the cache enabled however, in case they are helpful.

@andybao-dd andybao-dd requested review from a team as code owners November 28, 2024 00:32
@andybao-dd andybao-dd requested review from erikapierr and removed request for a team November 28, 2024 00:32
@victorr victorr self-assigned this Nov 29, 2024
@victorr
Copy link
Contributor

victorr commented Nov 29, 2024

Hi @andybao-dd, thank you for submitting this PR. I will be reviewing it.

Good job identifying the missing check!

Copy link
Contributor

@victorr victorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for being so thorough, your attention to detail is very appreciated.

@victorr
Copy link
Contributor

victorr commented Nov 29, 2024

The failed linters job is due to an unrelated change that has already been fixed on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants