Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing for duplicate groupby-aggregate operations #8

Open
tedmiddleton opened this issue Jul 21, 2022 · 0 comments
Open

Optimizing for duplicate groupby-aggregate operations #8

tedmiddleton opened this issue Jul 21, 2022 · 0 comments

Comments

@tedmiddleton
Copy link
Owner

For example, agg::stddev() involves calculating a mean and agg::mean() involves calculating a sum. Likewise, agg::corr() will end up calculating 2 means. In the case of something like

auto gf = fr1.groupby(_1, 2)
gf.aggregate(sum(_3), sum(_4), mean(_3), mean(_4), stddev(_3), corr(_3, _4))

...how many times will we be summing up the elements of each group in _3 and _4? Naively, it would be 4 times for _3 (with sum, mean, stddev, and corr) and 3 times with _4 (with sum, mean, and corr), but it seems like we should be able to cut it down to 1 time for _3 and _4 each.

I think the key here is to

  1. do ops in passes, so do all sum's first, then min's and max's, then means, then stddevs, then regress's, and then corr's
  2. put the column names in a dict and then check the dict before doing the calculation.

Whatever I do here I have to make sure that I'm not making it slower accidentally by doing dictionary lookups

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant