-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kill k8s pods earlier when specific errors appear #795
Comments
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
Kubernetes is designed to work asynchronously so I don't think we should kill the pods straight away. For example, if some other resource takes time to update, or if there's a temporary glitch with the container registry, retrying is the correct behaviour. |
I agree that ideally, we should't kill the pods straight away. However, in rare cases where retrying is not effective regardless of the underlying issue, I believe that we could kill the pod directly in order to save time and resources. Probably I'm noticing this more due to having a timeout of 20 minutes |
I'll +1 this just because even with a 5 minute time out, it does get a little cumbersome if you accidentally start a server with incorrect configurations i.e. mispell image, set wrong config, etc. Maybe another solution is to introduce an alternative "max-retries" instead of "timeout"? |
Proposed change
In our current setup, when we create a server, I can encounter any of the following errors:
"probe failed", "ErrImagePull", "ImagePullBackOff"
.Example:
2023-09-21T14:50:07Z [Warning] Readiness probe failed: Get "http://{my_ip}:8658/health/ready": dial tcp {my_ip}:8658: connect: connection refused
When this occurs, I have to wait for kubespawner to time out. In my case, the timeout is set to 20 minutes because occasionally I need to pull a large docker image.
It would be great to have a feature that automatically deletes the pods when certain Kubernetes errors (error messages provided by the users) are detected to avoid unnecessary waiting.
Alternative options
Who would use this feature?
Anyone that wants to kill their pods earlier if certain error messages appear.
(Optional): Suggest a solution
My simple patch for version 4.3.0 (can't update to new version due to k8s version). I think code is more explainable, but the idea is to have a variable like
self.kill_messages = ["probe failed", "ErrImagePull", "ImagePullBackOff"]
with the error messages that must kill the pod when found. If any of these errors are found make exponential_backoff raise an Error (code can definitely be improved)The text was updated successfully, but these errors were encountered: