-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU Soft Lockup when doing heavy IO via Kernel NFS server on local ephemeral storage #4307
Comments
No idea if this is related but googling for CPU soft lockup and a few other keywords led me to this issue: https://bbs.archlinux.org/viewtopic.php?id=264127&p=3 that appears to be affected up until Kernel 5.15 (which is what we are running it appears) |
Hey @snowzach, thanks for reporting this issue! A question:
|
@koooosh the ENA "TX hasn't completed" error is probably the most relevant here. awslabs/amazon-eks-ami#1704 mentions upgrading the ENA driver to 2.12.0 as a potential fix, but the 5.15 kernel is already on 2.13.0. We should try the repro. |
Hi @koooosh I have experienced it with older versions also. I could try with v1.26.2 as well. What's also interesting, I believe I experienced it also when using the AL2 nodes also but I don't recall the versions of those. I am not super familiar with how kernel versions work with Kubernetes versions, is Kernel 5.15 the highest I would be able to run with Kube 1.25? I am not super keen to upgrade Kubernetes but would consider it if I knew it would for sure fix that problem. |
Here's another interesting one that VERY closely matches the issue I am having: awslabs/amazon-eks-ami#454 that was resolved with a kernel patch from quite a few years ago. |
Just another thought, since the NFS server runs in the Kernel, is there something sort of throttling/policing of system resources by kubelet that could be causing this? |
That's correct - the Bottlerocket k8s-1.25 variant specifically uses kernel 5.15. If you'd like to use kernel 6.1, you would have to use k8s 1.28. |
Alright... So I decided to try creating a new cluster with 1.25 and was going to upgrade a version at a time to see if the issue stopped happening... I've been having an issue in us-west-2 and I spun this one up in us-east-2 to try to reproduce it.... and it didn't have the issue.... Same NFS server container, same Kube version 1.25, same bottlerocket version (testing with 1.27.1 now) The CNI plugin wasn't the same, tried upgrading to match, still had the issue. I'm at a loss unless there's some underlying hardware difference you can't see.. The only difference I can think of is that some of the plugins are different between them. (No issue cluster has newer versions of EFS driver, kube-proxy, EBS driver, CoreDNS) and the cluster having the issue has Grafana Agent collecting logs/metrics. That's it... Anything else I can check!?
Edit: I noticed the platform version is different: |
Hey @snowzach , could you please provide more details about your setup? You mentioned EFS (CSI?) driver, but aren't you running your own NFS server? I'm trying to understand how the EFS driver comes into the picture if the goal is to use your own NFS server. On a side note, I did a first test with Another question I have for you is related to your experience of using and configuring the ephemeral storage. I tried to use |
I build the node using Karpenter with the following settings (I'm using slightly old version of Karpenter)
To mount the NVMes I just specify an As far as configuring ephemeral storage I think it would be good to loosen the directories... as well I think it would be good to allow setting mount options. I mount with Alright, so I am seeing that this is NOT necessarily related to Bottlerocket. Info:
I'm starting to wonder if this is the ENA driver or something to do with NVMEs (or them fighting with each other). Attached is a log from the AL2 instance with the same failure. It looks almost exactly the same instead it says xfs instead of ext4. |
Platform I'm building on:
Running a very simple NFS server container on
bottlerocket-aws-k8s-1.25-x86_64-v1.26.1-943d9a41
Dockerfile:
Entrypoint script:
Essentially I run this on an AWS i3en with local flash provisioned as ephemeral storage shared via this NFS server. It's a high performance cache drive. Testing with
i3en.2xlarge
What I expected to happen:
It would be a super fast NFS server sharing this ephemeral storage.
What actually happened:
I can mount this storage from another
i3en.2xlarge
instance and mostly it works unless we really push it.If I run the disk testing tool
bonnie++ -d /the/nfs/share -u nobody
and wait, within a minute or two the machine will start displaying errors in the logs aboutwatchdog: BUG: soft lockup - CPU#7 stuck for 22s!
as well asena 0000:00:06.0 eth2: TX hasn't completed, qid 5, index 801. 26560 msecs since last interrupt, 41910 msecs since last napi execution, napi scheduled: 1
How to reproduce the problem:
Run the container, run bonnie++ on the NFS share.
It's very reliably reproduced.
Attached is a log:
bottlerocket-log.txt
The text was updated successfully, but these errors were encountered: