-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling restart results in 502 for ~minute from LoadBalancer #2222
Comments
Kubernetes implemented a new feature that allow to have zero downtime during rolling updates, it is called TerminatingEndpoints and was implemented in 1.26 https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/ You can find this presentation that explains the internal details https://docs.google.com/presentation/d/1DRZ6PFJHPq7Z3av0HyHVCjeXJ_v5xYYleq2Q4IZStDk/edit?usp=sharing basically your pods has to handle the SIGTERM signal to graceful terminate and enable the loadbalancers to remove the Pod out of the backends rotations before it dies, so new traffic is not blackholed |
@aojea What about the second scenario, where old pod is gracefully terminated and removed, but new pod's NEG is created but not properly attached to LB backend service? See second failure mode described here #1718 (comment) How does what you posted helps with this case? I distributed the application across the zones, and I still see failures in this way. |
@Tyrion85 How often are you seeing this? Typically only the first pod in a new zone will have a long start up time. One option is to add a preStop hook to the pods so that the old pod stays on longer and gives the new pod more time to come up. |
@swetharepakula we've started seeing this on every rolling restart now. Previously (no hard data, but ~until two weeks ago), these were only sporadic, hardly even noticable. There was no observable change on our side that would cause this, so my feeling is something is wrong internally on Google's side (as Google manages both NEG and Ingress Controllers), for these to become so frequent lately in our region/cluster. preStop hooks are actually in place - it was 20s initially (looking at internal git history, it has been 20s for year+), bumped to 60s recently, but no dice. I heavily edited above manifest to remove sensitive info, must have deleted preStop hook from example manifest by mistake, my bad. The preStop hook config is classic sleep, for 60s. App is just a basic express server, absolutely nothing fancy there. Has been working fine for years with GCE Ingress. From what I noticed today (after scaling to 3 replicas spread across zones, and pinging my endpoint while the restart is in progress), is that 502s are returned only when last old pod is in Terminating state, with three new pods in Running state. This tells me that backend is without serving pods, even though three pods are created and running (in the span of ~3 minutes). This seems super slow for Ingress Controller to not attach NEGs in 3 minutes, wouldn't you agree? Yet this is what happens, and has been happening for the last two weeks. Cluster is very small, 30+ nodes with autoscaler, GKE stable release, but we do use Ingress GCE a lot comparatively (20+ Ingresses, and thus 20+ LoadBalancers in this GCP project). Maybe some issue with scaling for Ingress Controller? As it stands right now, our best option seems to be to use some third-party Ingress Controller (such as nginx or Emissary-ingress), which itself will be with NEG (we have nginx ingress controller with GCP HTTPS LoadBalancer deployed actually, but don't really use it). As those are restarted fairly infrequently, at the price of one additional internal network hop we potentially gain a lot in stability. |
@Tyrion85 are you able to see that the endpoints are not in the NEG when the 502s are returned? You can look at the NetworkEndpointGroup cloud audit logs to see all the NEG related operations. I would also recommend opening up a support ticket so we can investigate your case closer. |
I'm in |
@swetharepakula Opening the support ticket seems like the only option, at this point. I don't think this even has much to do with Ingress controller per se. These two images are taken in the span of a couple of seconds, with Nginx Ingress Controller and HTTPS LoadBalancer with NEG - as you can see, all pods are up, but backend has no healthy endpoints. This is an extreme degradation of service reliability IMHO. Right now, its been 10 minutes, and NEG controller still hasn't picked up endpoints correctly. |
This can potentially be of relevance: It explains what might be happening and how to tackle it. |
New home for the documentation: https://cloud.google.com/kubernetes-engine/docs/troubleshooting/load-balancing#500-series-errors @Tyrion85 , to clarify NEG controller does not controller the health status of the endpoints. This is determined by the health check that is set up by the Ingress controller when using Ingress. Are you saying that the endpoints specified in the NEG controller are incorrect or that they aren't healthy immediately? Again a support case is likely the best approach as this is requires a case by case analysis, especially if the prestop hook is not working. |
@Tyrion85 is that deployment referenced by more than one |
https://cloud.google.com/kubernetes-engine/docs/troubleshooting/load-balancing#500-series-errors mentions a step adding a preStop hook to all the containers:
The field is named preStop and not preStopHook. Copying the code from docs resulting in this error: |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
GKE version v1.24.12-gke.1000, us-west1 regional cluster
Deployment:
Service:
Ingress:
Whenever there is a rolling restart, there is a decent chance 502 is returned to clients. A couple of months ago, it was sporadic - these days its every time. During that time, pods are healthy from the standpoint of Kubernetes (restart of the second port wouldn't start if the first one was not marked as ready). During the restart, healtchecks in GCP UI sometimes show unhealthy NEGs, even though the application is responding to checks (same checks as those for kubelet).
502 is returned even when only one pod is replaced - the second old pod is still "alive", and yet LB returns 502. As if the NEG is already destroyed.
After some unspecified time (15s-60s), NEGs seem to be created/attached correctly and proper response 200 Is returned to clients.
Service events don't have anything useful:
Events in pods:
How to even approach debugging this issue? Is it a race condition somewhere, and the issue with timeouts in healthchecks? From the standpoint of Kubernetes, everything is fine.
Looks same-ish as #1718
I opened another issue for visibility, as this is literally a show-stopper in using this product. Intermittent downtime on every deployment is simply not acceptable in my use-case.
The text was updated successfully, but these errors were encountered: