Investigate why certain runners are offline, including cases where runners are actually online and consuming jobs #14248

tt-rkim · 2024-10-24T20:24:01Z

No description provided.

tt-rkim · 2024-10-25T00:41:32Z

An example was vm-77. Don't have the links but not a big deal.

Looks like if a GitHub runner's host is completely out of space, the runner's various internal services start to die. I noticed stack traces in the logs for the runner service. Probably not able to write service logs.

This prevents the runner from calling home to be able to listen to jobs.

Fix was to clear.

@ttmchiou @TT-billteng I put a check into the reset scripts for VMs to look for >=99% disk usage, and will do a rm -rf .cache and docker system prune and delete some random docker logs.

TT-billteng · 2024-10-25T06:03:33Z

An example was vm-77. Don't have the links but not a big deal.

Looks like if a GitHub runner's host is completely out of space, the runner's various internal services start to die. I noticed stack traces in the logs for the runner service. Probably not able to write service logs.

This prevents the runner from calling home to be able to listen to jobs.

Fix was to clear.

@ttmchiou @TT-billteng I put a check into the reset scripts for VMs to look for >=99% disk usage, and will do a rm -rf .cache and docker system prune and delete some random docker logs.

Should the check kick in earlier?

tt-rkim · 2024-10-25T13:45:26Z

How earlier?

It's right after checking if the cards are in use in reset script

tt-rkim added infra-ci infrastructure and/or CI changes P1 machine-management labels Oct 24, 2024

tt-rkim self-assigned this Oct 24, 2024

tt-rkim closed this as completed Oct 25, 2024

tt-rkim mentioned this issue Oct 25, 2024

Investigate why the GitHub tt-metal-ci-vm-105 seems to not have killed its test before letting other jobs on #14225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate why certain runners are offline, including cases where runners are actually online and consuming jobs #14248

Investigate why certain runners are offline, including cases where runners are actually online and consuming jobs #14248

tt-rkim commented Oct 24, 2024

tt-rkim commented Oct 25, 2024

TT-billteng commented Oct 25, 2024

tt-rkim commented Oct 25, 2024

Investigate why certain runners are offline, including cases where runners are actually online and consuming jobs #14248

Investigate why certain runners are offline, including cases where runners are actually online and consuming jobs #14248

Comments

tt-rkim commented Oct 24, 2024

tt-rkim commented Oct 25, 2024

TT-billteng commented Oct 25, 2024

tt-rkim commented Oct 25, 2024