Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why certain runners are offline, including cases where runners are actually online and consuming jobs #14248

Closed
tt-rkim opened this issue Oct 24, 2024 · 3 comments
Assignees
Labels
infra-ci infrastructure and/or CI changes machine-management P1

Comments

@tt-rkim
Copy link
Collaborator

tt-rkim commented Oct 24, 2024

No description provided.

@tt-rkim tt-rkim added infra-ci infrastructure and/or CI changes P1 machine-management labels Oct 24, 2024
@tt-rkim tt-rkim self-assigned this Oct 24, 2024
@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 25, 2024

An example was vm-77. Don't have the links but not a big deal.

Looks like if a GitHub runner's host is completely out of space, the runner's various internal services start to die. I noticed stack traces in the logs for the runner service. Probably not able to write service logs.

This prevents the runner from calling home to be able to listen to jobs.

Fix was to clear.

@ttmchiou @TT-billteng I put a check into the reset scripts for VMs to look for >=99% disk usage, and will do a rm -rf .cache and docker system prune and delete some random docker logs.

@tt-rkim tt-rkim closed this as completed Oct 25, 2024
@TT-billteng
Copy link
Collaborator

An example was vm-77. Don't have the links but not a big deal.

Looks like if a GitHub runner's host is completely out of space, the runner's various internal services start to die. I noticed stack traces in the logs for the runner service. Probably not able to write service logs.

This prevents the runner from calling home to be able to listen to jobs.

Fix was to clear.

@ttmchiou @TT-billteng I put a check into the reset scripts for VMs to look for >=99% disk usage, and will do a rm -rf .cache and docker system prune and delete some random docker logs.

Should the check kick in earlier?

@tt-rkim
Copy link
Collaborator Author

tt-rkim commented Oct 25, 2024

How earlier?

It's right after checking if the cards are in use in reset script

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
infra-ci infrastructure and/or CI changes machine-management P1
Projects
None yet
Development

No branches or pull requests

2 participants