-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in moka 0.12 #329
Comments
Thank you for reporting. I saw the pull request for symbolicator to revert the moka version to v0.11, and I started to look into this issue. I was thinking to run your stress tool by myself and modify it until I can reproduce the issue. I was reading the manuals for storing debug symbol files on S3, and getting minidump from a native app, but they could take some time to me to understand. Anyway, to investigate the issue, we could use Lines 43 to 46 in 13bc37f
By periodically calling One of the issues with I have not used |
I presume the memory leak occurred due to stalled cache maintenance task processing in When it is stalled, the followings will happen:
Details of 2:
|
Yesterday, I tried Symbolicator's stress tool, but I did not reproduce the problem.
caches:
downloaded:
max_unused_for: 168m # 24 x 7
retry_misses_after: 1m
retry_malformed_after: 24m
derived:
max_unused_for: 168m
retry_misses_after: 1m
retry_malformed_after: 24m
I will continue using/modifying the stress tool to see if I can get a stalled cache maintenance task processing. |
Action Items for reproducing the problem: Try to trigger the problem:
Action Items
Track some metrics:Action Items
|
To ensure this will not happen, I spent hours on running mokabench, moka’s own stress test tool, during the beta period of v0.12.0-beta.X. For example, $ ./target/release/mokabench --num-clients 16,32 --ttl 3 --tti 1 \
--per-key-expiration --invalidate --insert-once --entry-api \
--eviction-listener immediate and confirmed that v0.12.0-beta.3 with #323 never got the maintenance tasks to stall. I ran it on both Linux x86_64 and macOS arm64. I also ran an integration test of Materialize database with moka v0.12 for ~67 hours so far, and it never failed (Linux x86_64).
Some thoughts:
|
I updated mokabench: moka-rs/mokabench#14 I will run more test tomorrow, but for now, I ran the tests only for I ran the tests with the following command to mimic the $ cargo run --release --features metrics -- --num-clients 8 --repeat 20 \
--ttl 9 --per-key-expiration --size-aware --insert-once --entry-api This plot shows the allocated memory from the global allocator (jemalloc) while running test with three different cache sizes. There were between 112 and 280 million inserts to each cache in about 6 minutes, and the memory usage was stable during the tests. (The left one was slightly increasing. I will run the same test on v0.11.3 to see if there is any difference)
|
v0.11.3 does not have such a trend. Perhaps this is the memory leak you saw? I will look closely into this trend in v0.12.0. v0.11.3 with Hit Ratio ~15%Above: Allocated MiB
v0.12.0 with Hit Ratio ~15%Above: Allocated MiB The scales of both X and Y axes are different between v0.11.3 and v0.12.0 plots.
|
I believe I might be onto something with the reentrant cache theory. I modified the use std::sync::Arc;
use std::time::Duration;
// Use the asynchronous cache.
use moka::future::Cache;
use rand::Rng;
use tokio::sync::Semaphore;
const NUM_TASKS: usize = 8;
const NUM_KEYS_PER_TASK: usize = 4;
fn value(n: usize) -> String {
format!("value {}", n)
}
#[derive(Clone)]
struct GlobalCache(Arc<Cache<u8, Vec<u8>>>);
async fn handle_a(a: &GlobalCache, b: &GlobalCache) {
let fut = async {
let init = Box::pin(async { handle_b(b).await });
let key = rand::random();
let _entry = a.0.entry(key).or_insert_with(init).await;
};
let timeout = rand::thread_rng().gen_range(0..25);
let _res = tokio::time::timeout(Duration::from_millis(timeout), fut).await;
}
async fn handle_b(b: &GlobalCache) -> Vec<u8> {
let init = Box::pin(async {
let delay = rand::thread_rng().gen_range(0..50);
tokio::time::sleep(Duration::from_millis(delay)).await;
Vec::with_capacity(1024)
});
let key = rand::random();
let entry = b.0.entry(key).or_insert_with(init).await;
entry.value().clone()
}
async fn run_once(a: GlobalCache, b: GlobalCache) {
// Create a cache that can store up to 10,000 entries.
let cache = Cache::new(10_000);
// Spawn async tasks and write to and read from the cache.
let tasks: Vec<_> = (0..NUM_TASKS)
.map(|i| {
// To share the same cache across the async tasks, clone it.
// This is a cheap operation.
let my_cache = cache.clone();
let start = i * NUM_KEYS_PER_TASK;
let end = (i + 1) * NUM_KEYS_PER_TASK;
let a = a.clone();
let b = b.clone();
tokio::spawn(async move {
// Insert 64 entries. (NUM_KEYS_PER_TASK = 64)
for key in start..end {
my_cache
.entry(key)
.or_insert_with(async {
handle_a(&a, &b).await;
value(key)
})
.await;
// get() returns Option<String>, a clone of the stored value.
assert_eq!(my_cache.get(&key).await, Some(value(key)));
}
// Invalidate every 4 element of the inserted entries.
for key in (start..end).step_by(4) {
// invalidate() is an async method, so await it.
my_cache.invalidate(&key).await;
}
})
})
.collect();
// Wait for all tasks to complete.
futures_util::future::join_all(tasks).await;
// Verify the result.
for key in 0..(NUM_TASKS * NUM_KEYS_PER_TASK) {
if key % 4 == 0 {
assert_eq!(cache.get(&key).await, None);
} else {
assert_eq!(cache.get(&key).await, Some(value(key)));
}
}
drop(cache)
}
#[tokio::main]
async fn main() {
let a = GlobalCache(Arc::new(Cache::new(100)));
let b = GlobalCache(Arc::new(Cache::new(100)));
let semaphore = Arc::new(Semaphore::new(1_000));
loop {
let guard = semaphore.clone().acquire_owned().await.unwrap();
let a = a.clone();
let b = b.clone();
tokio::spawn(async move {
run_once(a, b).await;
drop(guard);
});
}
} Running this with macOs According to that, the leaks seem to be related to async RwLock Guards created in As well as other allocations related to it, like the owned key: I have so far only run this for ~3 minutes, and its quite possible that these things are being freed eventually given more time, but it does not look like it on first glance. |
Also the interesting thing here is that it does not seem like the actual cache values are being leaked, just these |
Thank you. It was very helpful! I found the source of the bug and fixed it by this commit fef6a77ed5 on a topic branch. I am going to open a pull request for it. This bug was introduced by this PR for v0.12.0, which made When the map gets more than half full with live and deleted keys (or number of deleted keys exceeds a threshold), it will reallocate a memory region and copies the live key-values to them, by using the correct hash values for keys. After this happens, |
I merged the following pull requests into the
I believe the following was fixed by #330.
I have not ran the reproducer in #329 (comment) with macOS |
This does a `cargo update`, and also updates the `moka` dependency again. It was downgraded in #1310 because of a memory leak, which should now be solved. See moka-rs/moka#329.
This does a `cargo update`, and also updates the `moka` dependency again. It was downgraded in #1310 because of a memory leak, which should now be solved. See moka-rs/moka#329.
We have updated to Thanks so much for the help here. |
I tried upgrading from moka 0.11 to 0.12, which lead to a memory leak.
As we use moka extensively in multiple places, both sync and async variants, its hard to pinpoint this to a very specific usage.
One indication that I have though is that the leak only occurred in our native processing pipeline, and not the JS processing pipeline.
The native pipeline uses moka in a couple more places than the JS pipeline, both sync and async.
So far I’m a bit at a loss how to pinpoint the exact leak, I will try to run this through a local stresstest to see if that also exhibits the leak we were observing in production.
The text was updated successfully, but these errors were encountered: