-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dispatchable workflow for deleting test model runs #60
Changes from 21 commits
a35baca
c5c2404
3f0b8c1
d70deea
26c61bb
3297673
51843f3
49ea1a6
85bfcf8
61b9352
a3a28fa
5c9ddf6
a090cc1
c624b0b
17fcf90
b6e695d
0e56190
cf4ff3c
e23d34c
6cdb449
b02de7d
a4895c2
09cfa5b
759550d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,58 @@ | ||||||||||||||
# Workflow that can be manually dispatched to delete test model runs that | ||||||||||||||
# do not need to be persisted indefinitely. | ||||||||||||||
# | ||||||||||||||
# Gated such that it's impossible to delete runs older than the current | ||||||||||||||
# assessment cycle, where each assessment cycle starts in April. | ||||||||||||||
|
||||||||||||||
name: delete-model-runs | ||||||||||||||
|
||||||||||||||
on: | ||||||||||||||
workflow_dispatch: | ||||||||||||||
inputs: | ||||||||||||||
run-ids: | ||||||||||||||
description: > | ||||||||||||||
Comma-delimited list of IDs of model runs to delete (no spaces). Note | ||||||||||||||
that the workflow assumes these IDs correspond to model runs for the | ||||||||||||||
current upcoming assessment cycle, and if that's not the case the | ||||||||||||||
deletion script will raise an error. | ||||||||||||||
required: true | ||||||||||||||
type: string | ||||||||||||||
|
||||||||||||||
jobs: | ||||||||||||||
delete-model-runs: | ||||||||||||||
runs-on: ubuntu-latest | ||||||||||||||
permissions: | ||||||||||||||
# Needed to interact with GitHub's OIDC Token endpoint so we can auth AWS | ||||||||||||||
contents: read | ||||||||||||||
id-token: write | ||||||||||||||
steps: | ||||||||||||||
- name: Checkout repo code | ||||||||||||||
uses: actions/checkout@v4 | ||||||||||||||
|
||||||||||||||
- name: Setup R | ||||||||||||||
uses: r-lib/actions/setup-r@v2 | ||||||||||||||
with: | ||||||||||||||
use-public-rspm: true | ||||||||||||||
|
||||||||||||||
- name: Install system dependencies | ||||||||||||||
run: sudo apt-get install libgit2-dev | ||||||||||||||
shell: bash | ||||||||||||||
jeancochrane marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
- name: Disable renv sandbox to speed up install time | ||||||||||||||
run: echo "RENV_CONFIG_SANDBOX_ENABLED=FALSE" >> .Renviron | ||||||||||||||
shell: bash | ||||||||||||||
jeancochrane marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
- name: Setup renv | ||||||||||||||
uses: r-lib/actions/setup-renv@v2 | ||||||||||||||
jeancochrane marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
- name: Configure AWS credentials | ||||||||||||||
uses: aws-actions/configure-aws-credentials@v4 | ||||||||||||||
with: | ||||||||||||||
role-to-assume: ${{ secrets.AWS_IAM_ROLE_MODEL_DELETION_ARN }} | ||||||||||||||
aws-region: us-east-1 | ||||||||||||||
|
||||||||||||||
- name: Delete model runs | ||||||||||||||
run: Rscript ./R/delete_current_year_model_runs.R "$RUN_IDS" | ||||||||||||||
shell: bash | ||||||||||||||
env: | ||||||||||||||
RUN_IDS: ${{ inputs.run-ids }} | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick: Can't you just pass the input directly here?
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be indeed be cleaner! I decided to switch the format of the |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# Script to delete a list of model runs by ID from AWS. | ||
# | ||
# Accepts one argument, a comma-delimited list of run IDs for model runs | ||
# whose artifacts should be deleted. | ||
# | ||
# Assumes that model runs are restricted to the current assessment cycle, where | ||
# each assessment cycle starts in April. Raises an error if no objects matching | ||
# a given ID for the current year could be located in S3. This error will get | ||
# raised before any deletion occurs, so if one or more IDs are invalid then | ||
# no objects will be deleted. | ||
# | ||
# Example usage (delete the runs 123, 456, and 789 in the current year): | ||
# | ||
# delete_current_year_model_runs.R 123,456,789 | ||
|
||
library(glue) | ||
library(here) | ||
library(magrittr) | ||
source(here("R", "helpers.R")) | ||
|
||
current_date <- as.POSIXct(Sys.Date()) | ||
current_month <- current_date %>% format("%m") | ||
current_year <- current_date %>% format("%Y") | ||
|
||
# The following heuristic determines the current upcoming assessment cycle year: | ||
# | ||
# * From April to December (post assessment), `year` = next year | ||
# * From January to March (during assessment), `year` = current year | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this heuristic seem reasonable to you? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Totally! I would bump the cutoff month to May though, as sometimes we still do model touchup stuff in April. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done in a4895c2! |
||
year <- if (current_month < "03") { | ||
current_year | ||
} else { | ||
as.character(as.numeric(current_year) + 1) | ||
} | ||
|
||
# Convert the comma-delimited input to a vector of run IDs. Accepting one or | ||
# more positional arguments would be a cleaner UX, but since this script is | ||
# intended to be called from a dispatched GitHub workflow, it's easier to parse | ||
# one comma-delimited string than split a space-separated string passed as a | ||
# workflow input | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agonized over this interface, perhaps unnecessarily; I'm curious if you agree that a single comma-delimited string is the right compromise! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's fine! Though we might want to use a space-delimited string in order to match build-and-test-dbt. It would have to be quoted i.e. not positional args. I'm really fine with either though as long as it's clear what the input is supposed to be. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool, I switched the workflow input to accept a space-delimited string in a4895c2. |
||
raw_run_ids <- commandArgs(trailingOnly = TRUE) | ||
run_ids <- raw_run_ids %>% | ||
strsplit(split = ",", fixed = TRUE) %>% | ||
unlist() | ||
|
||
"Confirming artifacts exist for run IDs in year {year}: {raw_run_ids}" %>% | ||
glue::glue() %>% | ||
print() | ||
|
||
# We consider a run ID to be valid if it has any matching data in S3 for | ||
# the current year | ||
run_id_is_valid <- function(run_id, year) { | ||
return( | ||
model_get_s3_artifacts_for_run(run_id, year) %>% | ||
sapply(aws.s3::object_exists) %>% | ||
any() | ||
) | ||
} | ||
|
||
# We check for validity separate from the deletion operation for two reasons: | ||
# | ||
# 1. The aws.s3::delete_object API does not raise an error if an object does | ||
# not exist, so a delete operation alone won't alert us for an incorrect | ||
# ID | ||
# 2. Even if aws.s3::delete_object could raise an error for missing objects, | ||
# we want to alert the caller that one or more of the IDs were incorrect | ||
# before deleting any objects so that this script is nondestructive | ||
# in the case of a malformed ID | ||
valid_run_ids <- run_ids %>% sapply(run_id_is_valid, year = year) | ||
|
||
if (!all(valid_run_ids)) { | ||
invalid_run_ids <- run_ids[which(valid_run_ids == FALSE)] %>% | ||
paste(collapse = ", ") | ||
|
||
"Some run IDs are missing all S3 artifacts for {year}: {invalid_run_ids}" %>% | ||
glue::glue() %>% | ||
stop() | ||
jeancochrane marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
"Deleting S3 artifacts run IDs in year {year}: {run_ids}" %>% | ||
glue::glue() %>% | ||
print() | ||
Comment on lines
+79
to
+81
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To test deletion, I copied over run artifacts for two different model runs from In order to delete existing test model runs from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That works for me! Let's also delete other test runs created during the model infra construction. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good, will do! |
||
|
||
run_ids %>% sapply(model_delete_run, year = year) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: IMO it'd be really helpful here to have a default input showing an example (nonexistent) run ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, done in a4895c2! Note that the trade-off here is that default arguments are populated as actual values in the workflow input form, rather than placeholder text, so there's a chance of a caller misinterpreting the form and submitting it with the (fake) default values. This should just raise an error and not actually delete any data, however, so the risk is worth it in my view.