-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switching service selector causes small amount of downtime with NEGs #638
Comments
@freehan Any thoughts here? |
Can you allow some period of overlap during transition? For instance, you have blue pods and green pods. Suppose the service points to blue pods initially. This should help. Please let us know if this still causes downtime. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
@freehan @bowei I'm also facing this issue, not sure how to best revive this discussion? re: a short overlap, the point of a blue/green deploy is to not have any overlap at all, it seems that with NEG there's always few seconds of 502s when changing the selector, it varies between 1 ~ 5s to sometimes more towards 20 ~ 30s. without NEG (ofc in that case using a NodePort service) there's a pretty large window of time in which requests continue to flow to the old pods. the only thing that actually works properly is a LoadBalancer service without an ingress, in which case it switches perfectly. |
it seems like the issue is really that when the new endpoints are added to the NEG, their health checks are UNKNOWN for a few seconds (regardless if the pods already exist / are read or not) and it will return 502s for those ~5 seconds? changing the health checks interval doesn't matter if it's 1s or 30s, it's always ~5s of 502s... |
@rubensayshi Yes. You pointed out the key gap. The 502s are caused by the race between 2 parallel operations: 1. old endpoints getting removed 2. new endpoints getting added. If 2 is slower than 1, then you ended up with 502s during the transitions. I would suggest replacing
FYI, we are building out something that would allow a more seamless transition for blue/green deployments. That is going to be exposed in a higher level API hence one would not need to tweak the service selector. |
hey @freehan if there's something better on the horizon then I think we'll settle for the almost-perfect solution with the is that "something better" to be expected in the near future? end of Q2 or begin of Q3? and how can I make sure I won't miss the release of that, will it be listed in the GKE update notes? https://cloud.google.com/kubernetes-engine/docs/release-notes and thanks for replying, specially to such an old issue <3 |
Google acknowledge they have an issue and there is a public bug we can track: https://issuetracker.google.com/issues/180490128 (i.e. people still stumble upon this one) |
nice, thanks for sharing @haizaar , I'll subscribe to that issue. |
Had a chance to talk to GKE PM regarding this problem. While they won't fix the current issue (which is caused by NEG switching), they do have plans to alleviate it by supporting K8s Gateway API, which is really great since we would be able to do at least basic traffic management and routing without bringing heavy-weight likes of Istio. If you are on GKE, reach out to your GCP account manager to join a private preview of this feature once it's available. Or use other K8s Gateway implementations that already implement it. /CC @keidrych |
I have implemented a blue/green update strategy for my deployment by updating a version label in the deployment template. I then update the service selector to use the new version label when enough pods of the new version are ready.
I have found that strategy this causes about 5-10 seconds of downtime (HTTP 502s) when using NEGs. The service events show:
There were two pods of the old version (matched by the old service selector) and there is one pod of the new version at this point.
I'm wondering if there is a way to avoid changing service selectors causing downtime?
I suspect this issue is related to
https://cloud.google.com/kubernetes-engine/docs/how-to/container-native-load-balancing#scale-to-zero_workloads_interruption (#583)
The text was updated successfully, but these errors were encountered: