Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong VPA is attempting to update pod resources and failing #7499

Open
WesCossick opened this issue Nov 14, 2024 · 4 comments
Open

Wrong VPA is attempting to update pod resources and failing #7499

WesCossick opened this issue Nov 14, 2024 · 4 comments
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@WesCossick
Copy link

Which component are you using?:

vertical-pod-autoscaler

What version of the component are you using?:

Component version: Not sure; using the one auto-installed by GKE.

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.30.5
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.5-gke.1699000

What environment is this in?:

Google Cloud Platform's GKE.

What did you expect to happen?:

For the resources to be updated by the correct VPA.

What happened instead?:

The wrong VPA is updating pod resources and failing.

How to reproduce it (as minimally and precisely as possible):

Create the following two VPAs:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: example-1-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: CronJob
    name: example-1
  updatePolicy:
    updateMode: "Initial"
    minReplicas: 1
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: example-2-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: CronJob
    name: example-2
  updatePolicy:
    updateMode: "Initial"
    minReplicas: 1

Create the following two CronJobs:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: example-1
spec:
  schedule: * * * * *
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: example-1
              image: example:latest
apiVersion: batch/v1
kind: CronJob
metadata:
  name: example-2
spec:
  schedule: * * * * *
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: example-2
              image: example:latest

Using describe pod, see the following annotations on one of the cronjob's pods:

Annotations:          vpaObservedContainers: example-1
                      vpaUpdates: Pod resources updated by example-2-vpa: container 0: 

When this happens, the other cronjob's pods look correct and have their CPU and memory correctly set by the VPA:

Annotations:          vpaObservedContainers: example-2
                      vpaUpdates: Pod resources updated by example-2-vpa: container 0: cpu request, memory request

Which one is mixed up changes based on the order in which you create the VPAs it seems.

Anything else we need to know?:

I pared down the two CronJob resources to simplify them.

@WesCossick WesCossick added the kind/bug Categorizes issue or PR as related to a bug. label Nov 14, 2024
@adrianmoisey
Copy link
Member

/area vertical-pod-autoscaler

@raywainman
Copy link
Contributor

raywainman commented Nov 15, 2024

I believe this issue would happen with the VPA deployed from OSS as well.

The way VPA knows which pods to target is that it will fetch the CronJob object:

https://github.com/bskiba/autoscaler/blob/8ff3b4fd47bcba514c9e904b6bee481ff0f3703e/vertical-pod-autoscaler/pkg/target/fetcher.go#L187-L192

And it fetches the apiObj.Spec.JobTemplate.Spec.Template.Labels value to then do pod selection.

From your example above, you are not setting any labels in your jobTemplate which basically means it is matching all pods in the same namespace (I'd have to confirm in the code to be 100% sure but this is fairly likely).

So now the two VPAs are essentially racing and competing against each-other.

To fix this, you'll need to set some labels in your two CronJob templates to make sure you are differentiating them from each-other.

Hope that helps :)

@WesCossick
Copy link
Author

@raywainman Yep, that seems to be the issue! In the real environment where we were doing experimentation, the CronJob objects did actually have one label, but it was identical for each CronJob. Providing unique labels to each worked around the issue.

Not sure if y'all are considering this a bug, though. If not, at a minimum it would be helpful if this behavior was documented somewhere clearly so that it doesn't confuse others in the same way.

@raywainman
Copy link
Contributor

That's a great idea, it's an easy pitfall.

I thought we had something documenting this but I actually couldn't find anything.

Let me see if I can put together a PR with this.

Thanks @WesCossick!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants