Optimizing for duplicate groupby-aggregate operations #8

tedmiddleton · 2022-07-21T22:27:46Z

For example, agg::stddev() involves calculating a mean and agg::mean() involves calculating a sum. Likewise, agg::corr() will end up calculating 2 means. In the case of something like

auto gf = fr1.groupby(_1, 2)
gf.aggregate(sum(_3), sum(_4), mean(_3), mean(_4), stddev(_3), corr(_3, _4))

...how many times will we be summing up the elements of each group in _3 and _4? Naively, it would be 4 times for _3 (with sum, mean, stddev, and corr) and 3 times with _4 (with sum, mean, and corr), but it seems like we should be able to cut it down to 1 time for _3 and _4 each.

I think the key here is to

do ops in passes, so do all sum's first, then min's and max's, then means, then stddevs, then regress's, and then corr's
put the column names in a dict and then check the dict before doing the calculation.

Whatever I do here I have to make sure that I'm not making it slower accidentally by doing dictionary lookups

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing for duplicate groupby-aggregate operations #8

Optimizing for duplicate groupby-aggregate operations #8

tedmiddleton commented Jul 21, 2022

Optimizing for duplicate groupby-aggregate operations #8

Optimizing for duplicate groupby-aggregate operations #8

Comments

tedmiddleton commented Jul 21, 2022