CPU Soft Lockup when doing heavy IO via Kernel NFS server on local ephemeral storage #4307

snowzach · 2024-11-19T14:42:45Z

Platform I'm building on:

Running a very simple NFS server container on bottlerocket-aws-k8s-1.25-x86_64-v1.26.1-943d9a41

Dockerfile:

# syntax=docker/dockerfile:1.9
FROM public.ecr.aws/debian/debian:bookworm

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends \
    nfs-common nfs-kernel-server curl ca-certificates zip unzip make time jq yq netbase iproute2 net-tools bind9-dnsutils procps xz-utils nano && rm -rf /var/lib/apt/lists/*

# Copy the entrypoint script
COPY --chmod=755 ./tools/docker/cloud-nfs/kernel/cloud-nfs-entrypoint.sh /entrypoint.sh
RUN mkdir /exports

# Install the AWS CLI
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
    && unzip awscliv2.zip \
    && ./aws/install \
    && rm -rf \
    awscliv2.zip

# Expose ports for NFSD/StatD/MountD/QuotaD
EXPOSE 2049 2050 2051 2052
VOLUME /exports

# Entrypoint
ENTRYPOINT ["/entrypoint.sh"]
CMD [ "/exports" ]

Entrypoint script:

#!/bin/bash
set -e

NFS_THREADS=${NFS_THREADS:-64}

function start() {

    # prepare /etc/exports
    fsid=0
    for i in "$@"; do
        echo "$i *(rw,fsid=$fsid,no_subtree_check,no_root_squash)" >> /etc/exports
        if [ -v gid ] ; then
            chmod 070 $i
            chgrp $gid $i
        fi
        echo "Serving $i"
        fsid=$((fsid + 1))
    done

    # start rpcbind if it is not started yet
    set +e
    /usr/sbin/rpcinfo 127.0.0.1 > /dev/null; s=$?
    set -e
    if [ $s -ne 0 ]; then
       echo "Starting rpcbind"
       /sbin/rpcbind -w
    fi

    mount -t nfsd nfds /proc/fs/nfsd

    # -V 3: enable NFSv3
    /usr/sbin/rpc.mountd -p 2050

    /usr/sbin/exportfs -r
    # -G 10 to reduce grace time to 10 seconds (the lowest allowed)
    # -V 3: enable NFSv3
    /usr/sbin/rpc.nfsd -G 10 -p 2049 $NFS_THREADS
    /sbin/rpc.statd --no-notify -p 2051 -o 2052 -T 2053
    echo "NFS started with $NFS_THREADS threads"
}

function stop()
{
    echo "Stopping NFS"

    /usr/sbin/rpc.nfsd 0
    /usr/sbin/exportfs -au
    /usr/sbin/exportfs -f

    kill $( pidof rpc.mountd )
    umount /proc/fs/nfsd
    echo > /etc/exports
    exit 0
}

trap stop TERM

start "$@"

# Keep the container running
sleep infinity

Essentially I run this on an AWS i3en with local flash provisioned as ephemeral storage shared via this NFS server. It's a high performance cache drive. Testing with i3en.2xlarge

What I expected to happen:

It would be a super fast NFS server sharing this ephemeral storage.

What actually happened:

I can mount this storage from another i3en.2xlarge instance and mostly it works unless we really push it.
If I run the disk testing tool bonnie++ -d /the/nfs/share -u nobody and wait, within a minute or two the machine will start displaying errors in the logs about watchdog: BUG: soft lockup - CPU#7 stuck for 22s! as well as ena 0000:00:06.0 eth2: TX hasn't completed, qid 5, index 801. 26560 msecs since last interrupt, 41910 msecs since last napi execution, napi scheduled: 1

How to reproduce the problem:

Run the container, run bonnie++ on the NFS share.

It's very reliably reproduced.

Attached is a log:
bottlerocket-log.txt

The text was updated successfully, but these errors were encountered:

snowzach · 2024-11-19T14:53:25Z

No idea if this is related but googling for CPU soft lockup and a few other keywords led me to this issue: https://bbs.archlinux.org/viewtopic.php?id=264127&p=3 that appears to be affected up until Kernel 5.15 (which is what we are running it appears)

koooosh · 2024-11-19T23:36:24Z

Hey @snowzach, thanks for reporting this issue! A question:

Have you experienced this issue with older versions of Bottlerocket (< v1.26.1) or even our latest release v1.26.2?

bcressey · 2024-11-20T00:21:04Z

@koooosh the ENA "TX hasn't completed" error is probably the most relevant here. awslabs/amazon-eks-ami#1704 mentions upgrading the ENA driver to 2.12.0 as a potential fix, but the 5.15 kernel is already on 2.13.0.

We should try the repro. aws-k8s-1.25 is still fully supported and if it's a driver issue, moving to aws-k8s-1.28 and a newer kernel may not help.

snowzach · 2024-11-20T19:46:10Z

Hi @koooosh I have experienced it with older versions also. I could try with v1.26.2 as well. What's also interesting, I believe I experienced it also when using the AL2 nodes also but I don't recall the versions of those. I am not super familiar with how kernel versions work with Kubernetes versions, is Kernel 5.15 the highest I would be able to run with Kube 1.25? I am not super keen to upgrade Kubernetes but would consider it if I knew it would for sure fix that problem.

snowzach · 2024-11-21T11:59:45Z

Here's another interesting one that VERY closely matches the issue I am having: awslabs/amazon-eks-ami#454 that was resolved with a kernel patch from quite a few years ago.

snowzach · 2024-11-21T12:06:49Z

Just another thought, since the NFS server runs in the Kernel, is there something sort of throttling/policing of system resources by kubelet that could be causing this?

koooosh · 2024-11-21T22:29:18Z

is Kernel 5.15 the highest I would be able to run with Kube 1.25?

That's correct - the Bottlerocket k8s-1.25 variant specifically uses kernel 5.15.

If you'd like to use kernel 6.1, you would have to use k8s 1.28.

snowzach · 2024-11-21T23:53:59Z

Alright... So I decided to try creating a new cluster with 1.25 and was going to upgrade a version at a time to see if the issue stopped happening... I've been having an issue in us-west-2 and I spun this one up in us-east-2 to try to reproduce it.... and it didn't have the issue....

Same NFS server container, same Kube version 1.25, same bottlerocket version (testing with 1.27.1 now) The CNI plugin wasn't the same, tried upgrading to match, still had the issue. I'm at a loss unless there's some underlying hardware difference you can't see.. The only difference I can think of is that some of the plugins are different between them. (No issue cluster has newer versions of EFS driver, kube-proxy, EBS driver, CoreDNS) and the cluster having the issue has Grafana Agent collecting logs/metrics. That's it...

Anything else I can check!?

Linux ip-10-250-7-127.us-east-2.compute.internal 5.15.168 #1 SMP Fri Nov 15 19:12:46 UTC 2024 x86_64 GNU/Linux
[    2.132461] ena 0000:00:05.0: Elastic Network Adapter (ENA) v2.13.0g
bottlerocket-aws-k8s-1.25-x86_64-v1.27.1-efd46c32
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

Linux ip-172-27-57-172.us-west-2.compute.internal 5.15.168 #1 SMP Fri Nov 15 19:12:46 UTC 2024 x86_64 GNU/Linux
[    4.942529] ena 0000:00:05.0: Elastic Network Adapter (ENA) v2.13.0g
bottlerocket-aws-k8s-1.25-x86_64-v1.27.1-efd46c32
Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

Edit: I noticed the platform version is different: eks.25 vs eks.35 but not sure if that should matter..

arnaldo2792 · 2024-11-22T23:38:45Z

Hey @snowzach, I'm taking a look at this, sorry for the late reply 👍 . I'm reproducing as @bcressey suggested.

arnaldo2792 · 2024-11-23T01:06:40Z

Hey @snowzach , could you please provide more details about your setup? You mentioned EFS (CSI?) driver, but aren't you running your own NFS server? I'm trying to understand how the EFS driver comes into the picture if the goal is to use your own NFS server.

On a side note, I did a first test with bonie++ with the NFS server using 8 threads instead of 64 in Bottlerocket k8s 1.25, 1.27.1, and I didn't see the failure. I'll perform another test with 64 threads to get as close as I can to your environment. It took me a while to get the NFS server up and running, but it seems like you are using a slightly different version than the script provided in the registry.k8s.io/volume-nfs image (use NFS v4 instead of v3 and increase the number of threads).

Another question I have for you is related to your experience of using and configuring the ephemeral storage. I tried to use apiclient ephemeral-storage init and apiclient ephemeral-storage bind --dirs /mnt/block but the later failed since we are strict on the directories that can be used to mount the RAID array configured with apiclient ephemeral-storage. My question for you is do you use bootstrap containers to configure the ephemeral storage? I'm trying to get data to see if we should discuss whether it would be valuable to loosen the dirs constrains a bit.

snowzach · 2024-11-23T12:54:53Z

I build the node using Karpenter with the following settings (I'm using slightly old version of Karpenter)

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: deck
  labels:
    app.kubernetes.io/name: deck
    app.kubernetes.io/component: deck
spec:
  securityGroupSelector:
    karpenter.sh/discovery: {{ .Values.environment }},{{ .Values.environment }}-nfs
  subnetSelector:
    Type: private
  amiFamily: Bottlerocket
  userData: |
    [settings.kernel.sysctl]
    "net.core.rmem_max" = "67108864"
    "net.core.wmem_max" = "67108864"
    "net.ipv4.tcp_rmem" = "4096 87380 33554432"
    "net.ipv4.tcp_wmem" = "4096 87380 33554432"
    "net.ipv4.tcp_congestion_control" = "htcp"
    "net.ipv4.tcp_mtu_probing" = "1"
    "net.core.default_qdisc" = "fq"
    "fs.file-max" = "512000000"
    "vm.min_free_kbytes" = "524288"

    [settings.bootstrap-commands.k8s-ephemeral-storage]
    commands = [
      ["apiclient", "ephemeral-storage", "init", "-t", "ext4"],
      ["apiclient", "ephemeral-storage" ,"bind", "--dirs", "/var/lib/kubelet"]
    ] 
    essential = true
    mode = "always"

To mount the NVMes I just specify an emptyDir: {} and since it's setup to use ephemeral storage it mounts the NVMe in the container for me. I only mentioned EFS because I have the EFS plugin in kubernetes enabled so there are a couple daemonsets that run on every node (including the one in question) I doubt it's related but figured I should mention any differences.

As far as configuring ephemeral storage I think it would be good to loosen the directories... as well I think it would be good to allow setting mount options. I mount with noatime and nodiratime when I can to increase speed. As well, making it so XFS works would be nice as well and specify the mkfs options so you can optimize the size.

Alright, so I am seeing that this is NOT necessarily related to Bottlerocket.

Info:

I can reproduce this issue on Bottlerocket and AL2 Kubernetes nodes
I can reproduce this error with XFS or EXT4 underlying filesystem
I could NOT reproduce this in us-east-2 when building another cluster there (only on my cluster/node in us-west-2b)
It runs for a while and then a kernelworker thread spikes to 100% (the machine is not 100% loaded) and the whole machine hangs

I'm starting to wonder if this is the ENA driver or something to do with NVMEs (or them fighting with each other). Attached is a log from the AL2 instance with the same failure. It looks almost exactly the same instead it says xfs instead of ext4.
failure-nfs-AL2.txt

snowzach added status/needs-triage Pending triage or re-evaluation type/bug Something isn't working labels Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Soft Lockup when doing heavy IO via Kernel NFS server on local ephemeral storage #4307

CPU Soft Lockup when doing heavy IO via Kernel NFS server on local ephemeral storage #4307

snowzach commented Nov 19, 2024

snowzach commented Nov 19, 2024

koooosh commented Nov 19, 2024 •

edited

Loading

bcressey commented Nov 20, 2024

snowzach commented Nov 20, 2024

snowzach commented Nov 21, 2024 •

edited

Loading

snowzach commented Nov 21, 2024 •

edited

Loading

koooosh commented Nov 21, 2024

snowzach commented Nov 21, 2024 •

edited

Loading

arnaldo2792 commented Nov 22, 2024

arnaldo2792 commented Nov 23, 2024

snowzach commented Nov 23, 2024 •

edited

Loading

CPU Soft Lockup when doing heavy IO via Kernel NFS server on local ephemeral storage #4307

CPU Soft Lockup when doing heavy IO via Kernel NFS server on local ephemeral storage #4307

Comments

snowzach commented Nov 19, 2024

snowzach commented Nov 19, 2024

koooosh commented Nov 19, 2024 • edited Loading

bcressey commented Nov 20, 2024

snowzach commented Nov 20, 2024

snowzach commented Nov 21, 2024 • edited Loading

snowzach commented Nov 21, 2024 • edited Loading

koooosh commented Nov 21, 2024

snowzach commented Nov 21, 2024 • edited Loading

arnaldo2792 commented Nov 22, 2024

arnaldo2792 commented Nov 23, 2024

snowzach commented Nov 23, 2024 • edited Loading

koooosh commented Nov 19, 2024 •

edited

Loading

snowzach commented Nov 21, 2024 •

edited

Loading

snowzach commented Nov 21, 2024 •

edited

Loading

snowzach commented Nov 21, 2024 •

edited

Loading

snowzach commented Nov 23, 2024 •

edited

Loading