Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements: low hanging fruits #104

Open
PathosEthosLogos opened this issue Dec 21, 2023 · 5 comments
Open

Improvements: low hanging fruits #104

PathosEthosLogos opened this issue Dec 21, 2023 · 5 comments
Labels
pipeline Code that glues all the model bits together

Comments

@PathosEthosLogos
Copy link

PathosEthosLogos commented Dec 21, 2023

Hi there, saw the talk at Rconf. Looks like you've done a lot of work here. It looks solid, though there are many improvements that can be made. I have some suggestions that are low hanging fruits after a quick glance at the repo I think should be seriously considered.

I will start with a bit of a sanity check -- in the README, it says:

The total amount of RAM needed for overall model fitting is around 6GB

Is 6GB RAM a typo? In the talk, Dan says that this model takes a few days, but 6GB is below-average (some might argue below the half of average) of today's $500 consumer-grade or home laptop. It was confusing on why AWS would be necessary or why it would take a few days with LGBM. I saw the 5 download files and total combined was well below 1GB, though joined data is probably different story, which makes 6GB RAM sound about right for modelling (which makes it puzzling). For handling $16B in assets, getting a GPU + some RAM at the cost of $400 to $1,500 GPU are virtually nothing in comparison and serves the public much better. Some clarification on this would be appreciated.

So this leads me to the low hanging fruit: better code/lib choices.

I see a lot of dplyr functions, magrittr and %>% instead of |>, etc. Switching to tidytable would also be an extremely low hanging fruit while keeping 99% of the dplyr codebase. You can remove all dplyr::, tidyr::, tidyselect::, purrr:: (I might be missing a few more), add library(tidytable); conflict_prefer_all(winner = "tidytable", quiet = TRUE), and the 1% change needed would be 12 places where you use group_by. Here is a demo of the very simple changes:

# dplyr
df |>
  group_by(column_group) |>
  mutate(column_new = column_old + 1) |>
  ungroup()

# tidytable
df |>
  mutate(.by = column_group,
         column_new = column_old + 1)

And the performance gains would feel the same as switching from XGBoost to LGBM, while still not breaking 99% of the code.

Plus, though a very minor gain, |> is faster than %>%. This also removes the dependency on magrittr library.

And lastly, base R is slower than C++ based libraries akin to tidytable. Though this is not as low of a hanging fruit, code conversion from base R to tidytable makes the code much easier to read and more collaborative and reduces human communication time.

And as a minor nitpick bonus, some of your code comments are 2x or 5x longer than the code itself. This shouldn't be necessary especially for tidy-piped syntax when the code is already so easy to read.

@dfsnow
Copy link
Member

dfsnow commented Dec 21, 2023

Hi @PathosEthosLogos! Thanks for the input. We really do appreciate anyone who takes the time to give us feedback. To answer your questions/suggestions:

Is 6GB RAM a typo? In the talk, Dan says that this model takes a few days, but 6GB is below-average (some might argue below the half of average) of today's $500 consumer-grade or home laptop. It was confusing on why AWS would be necessary or why it would take a few days with LGBM. I saw the 5 download files and total combined was well below 1GB, though joined data is probably different story, which makes 6GB RAM sound about right for modelling (which makes it puzzling). For handling $16B in assets, getting a GPU + some RAM at the cost of $400 to $1,500 GPU are virtually nothing in comparison and serves the public much better. Some clarification on this would be appreciated.

To clarify, it takes roughly 30 hours to run a full cross-validation cycle of the model that checks a huge amount of the hyperparameter space. Training a single LightGBM model using the finalized parameters and data takes <10 minutes.

We have some open issues (#36, #38) to reduce the CV time, but they aren't high priority. Boosted tree models tend to have a hyperparameter "plateau", i.e. a stable region of parameters where tuning further doesn't change performance much.

RE: Hardware and AWS. I did a whole dive benchmarking different models, hardware, and costs. The tl;dr is that it's cheapest and most efficient to simply provision AWS Batch jobs (with any hardware specs we want) to run our models. This lets us run many jobs in parallel and spin down the jobs when we aren't modeling (which is most of the year).

So this leads me to the low hanging fruit: better code/lib choices.

I'm a big data.table booster myself (I recently tried a nativedata.table rewrite of some stages of the pipeline), but the performance improvements (while substantial) aren't really worth the cost.

The vast majority of the pipeline's runtime comes from the model itself, or calculating feature importance, SHAP values, etc. Given the choice between refactoring for a marginal speedup of something that already works and trying new modeling methods/features/diagnostics, we're opting for the latter.

And as a minor nitpick bonus, some of your code comments are 2x or 5x longer than the code itself. This shouldn't be necessary especially for tidy-piped syntax when the code is already so easy to read.

This is a totally fair point 😄. At some point we need to go through and clean up the old comments.

@PathosEthosLogos
Copy link
Author

It seems that there is a bit of misunderstanding. I will try to clarify. (seems that AWS + consumer grade GPU part isn't really a low hanging fruit anymore so I will skip, as well as LGBM as it requires a bit more work for saving model objects for production)

I want to emphasise the low hanging fruit of tidytable (definitely not data.table(!), which is the backend of tidytable). The code changes will be minimal (link provided) and will work identically, except faster, so I think there is a wee bit of misunderstanding here about what you mentioned as costs. With conflicted overrides to dplyr functions (among few others) in tidymodels, it should also speed up the modelling time at least a bit.

@dfsnow
Copy link
Member

dfsnow commented Dec 23, 2023

Ah, got it. Sure I'm happy to test out tidytable in the next few weeks. I'd also be delighted to review a PR with your proposed changes 👀

@dfsnow dfsnow added this to the 2024 pipeline upgrades milestone Dec 27, 2023
@dfsnow dfsnow added the pipeline Code that glues all the model bits together label Dec 27, 2023
@dfsnow
Copy link
Member

dfsnow commented Jan 14, 2024

@Damonamajor or @wagnerlmichael, one of you want to take a crack at this this upcoming week?

@Damonamajor Damonamajor self-assigned this Jan 19, 2024
@dfsnow
Copy link
Member

dfsnow commented Jan 24, 2024

Looks like using tidytable would require some significant refactoring to handle the complicated ifelse() logic we have in the assess stage. Right now, the speed benefit isn't really worth it, so I'm punting this issue to later this year.

@dfsnow dfsnow removed this from the 2024 pipeline upgrades milestone Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pipeline Code that glues all the model bits together
Projects
None yet
Development

No branches or pull requests

3 participants