[ENH] Adding binning capabilities to decision trees #23

adam2392 · 2022-06-14T19:39:46Z

Describe the workflow you want to enable

I am part of the @neurodata team. Binning features have resulted in highly efficient and little loss in performance in gradient-boosted trees. This feature should not only be used in gradient-boosted trees, but should be available within all decision trees [1].

By including binning as a feature for decision trees, we would enable massive speedups for decision trees that operate on high-dimensional data (both in features and sample sizes). This would be an additional tradeoff that users can take. The intuition behind binning for decision trees would be exactly that of Gradient Boosted Trees.

Describe your proposed solution

We propose introducing binning to the decision tree classifier and regressor.

An initial PR is proposed here: #24 (review)
However, it seems that many of the files were copied and it is not 100% clear if needed. Perhaps we can explore how to consolidate the _binning.py/pyx files using the current versions under ensemble/_hist_gradient_boosting/*.

Changes to the Cython codebase

TBD

Changes to the Python API

The following two parameters would be added to the DecisionTreeClassifier and Regressor:

hist_binning=False,
max_bins=255

where the default number of bins follows that of histgradient boosting.

Additional context

These changes can also trivially be applied to Oblique Trees.

References:
[1] https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree

The text was updated successfully, but these errors were encountered:

adam2392 · 2022-06-14T19:52:25Z

@jovo and @jshinm here's a draft Issue for the binning problem for decision trees.

I am parsing @p-teng 's proposed OG changes. Currently too much copy-pasting from the original "binning" files to make it work for decision trees. sklearn-devs will prefer/want something with a short change. Basically, need to determine if we can just import/short-modify the existing binning code to allow decision trees to be also subbed in.

@p-teng if you have any comments based on when you tried it, please lmk. E.g. If there were any technical reasons you needed to copy/paste the existing binning files from ensemble/_hist_gradient/* into ensemble/* folder. Thanks!

CC: @PSSF23 incase you're interested.

PSSF23 · 2022-06-17T00:10:37Z

As mentioned in my comment in #22, this branch should be more helpful if you want to compare changes Philip made: https://github.com/PSSF23/scikit-learn-stream/tree/hist

Overall, the work does not look complete to me.

adam2392 · 2022-06-21T02:08:58Z

Unfortunately the work in #22, Philip's changes are still very much incomplete. There wasn't a need to copy certain code. Moreover, the binning capability needs to be consistent between fit/predict phases. Finally, the code itself does not work or compile and is simply a "skeleton" of what needs to be done.

To eventually make things work... I think this can be as simple as just adding a few LOC to each Python function to enable the trees to bin the data. Possibly refactoring out a private function that does this... I've begun doing this in #25. I think if @jshinm is interested in this kind of scientific software work and getting experience in developing OSS, then this is a good opportunity since the changes needed to make everything work will be considerably less compared to that of Oblique Forests.

We'll see what jovo thinks. This probably will require a bit more work + testing, so idk if I have the time to dedicate to this.

jshinm · 2022-06-23T21:23:28Z

@adam2392 Thank you for your insights, Adam! This is a definitely interesting project and some guidance on your end will certainly keep the ball rolling 😄

adam2392 · 2022-06-28T15:48:06Z

Another note from the sklearn devs:

in order to add the binning capabilities to the trees, most likely, we would need to enable it for all trees, meaning RandomForests and ExtraTrees. Then I think we would want to show that there is a relevant performance/efficiency tradeoff.

^ this seems like quite a bit of work, so I think a simple sketch of what the code needs to look like is ideal first and then discussing with sklearn devs is the next step. Ideally someone from NDD can do this because it's a straightforward coding project imo (assuming you know some Cython and how to structure Python files).

adam2392 · 2023-03-27T15:45:13Z

Seems like predict has some different implementations based on the DTYPE:

https://github.com/scikit-learn/scikit-learn/blob/75ae699281549bf455f0767fa65e78268c8fd704/sklearn/ensemble/_hist_gradient_boosting/_predictor.pyx#L16

The issue is that https://github.com/scikit-learn/scikit-learn/blob/b75b5aaddef89931d372ab6a22759eaf8385e6b1/sklearn/ensemble/_hist_gradient_boosting/binning.py#L276-L279 also shows that the BinMapper transforms the datatype to uint8 to conserve RAM, but this would require generalizing the Tree Splitter to accept uint8 DTYPEs, which is not trivial.

I think for now to enable binning at this scale, we can just sacrifice RAM (i.e. create a copy of the dataset that is binned that is of np.float64, which obviously would be more RAM usage unnecessarily).

A path to getting around this is adding a fused type definition for places like this: https://github.com/scikit-learn/scikit-learn/blob/b75b5aaddef89931d372ab6a22759eaf8385e6b1/sklearn/tree/_splitter.pyx#L708

This way, the class can work for both DTYPE_t (i.e. np.float64) and e.g. uint8.

#### Reference Issues/PRs Closes: #25 Closes: #23 #### What does this implement/fix? Explain your changes. Adds preliminary capability for binning features. The cons is we need to "bin" again during predict time. I've documented how we can get around this in the README, but it will involve some heavy-duty Cython coding. - Remove Oblique tree models and migrated to scikit-tree - Fix up CI so that way unnecessary workflows are not ran on the fork - Updated the documentation with the current limitation of binning --------- Signed-off-by: Adam Li <adam2392@gmail.com>

github-actions bot added the Needs Triage label Jun 14, 2022

This was referenced Jun 17, 2022

[DRAFT] Implement histogram binning #24

Closed

[DRAFT] Allowing trees to bin data #25

Closed

adam2392 mentioned this issue Feb 6, 2023

Implement binning of X as an optional preprocessing step for trees neurodata/treeple#25

Open

adam2392 mentioned this issue Mar 28, 2023

Binfork #38

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Adding binning capabilities to decision trees #23

[ENH] Adding binning capabilities to decision trees #23

adam2392 commented Jun 14, 2022 •

edited

Loading

adam2392 commented Jun 14, 2022

PSSF23 commented Jun 17, 2022 •

edited

Loading

adam2392 commented Jun 21, 2022 •

edited

Loading

jshinm commented Jun 23, 2022

adam2392 commented Jun 28, 2022

adam2392 commented Mar 27, 2023

[ENH] Adding binning capabilities to decision trees #23

[ENH] Adding binning capabilities to decision trees #23

Comments

adam2392 commented Jun 14, 2022 • edited Loading

Describe the workflow you want to enable

Describe your proposed solution

Changes to the Cython codebase

Changes to the Python API

Additional context

adam2392 commented Jun 14, 2022

PSSF23 commented Jun 17, 2022 • edited Loading

adam2392 commented Jun 21, 2022 • edited Loading

jshinm commented Jun 23, 2022

adam2392 commented Jun 28, 2022

adam2392 commented Mar 27, 2023

adam2392 commented Jun 14, 2022 •

edited

Loading

PSSF23 commented Jun 17, 2022 •

edited

Loading

adam2392 commented Jun 21, 2022 •

edited

Loading