-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Adding binning capabilities to decision trees #23
Comments
@jovo and @jshinm here's a draft Issue for the binning problem for decision trees. I am parsing @p-teng 's proposed OG changes. Currently too much copy-pasting from the original "binning" files to make it work for decision trees. sklearn-devs will prefer/want something with a short change. Basically, need to determine if we can just import/short-modify the existing binning code to allow decision trees to be also subbed in. @p-teng if you have any comments based on when you tried it, please lmk. E.g. If there were any technical reasons you needed to copy/paste the existing binning files from CC: @PSSF23 incase you're interested. |
As mentioned in my comment in #22, this branch should be more helpful if you want to compare changes Philip made: https://github.com/PSSF23/scikit-learn-stream/tree/hist Overall, the work does not look complete to me. |
Unfortunately the work in #22, Philip's changes are still very much incomplete. There wasn't a need to copy certain code. Moreover, the binning capability needs to be consistent between fit/predict phases. Finally, the code itself does not work or compile and is simply a "skeleton" of what needs to be done. To eventually make things work... I think this can be as simple as just adding a few LOC to each Python function to enable the trees to bin the data. Possibly refactoring out a private function that does this... I've begun doing this in #25. I think if @jshinm is interested in this kind of scientific software work and getting experience in developing OSS, then this is a good opportunity since the changes needed to make everything work will be considerably less compared to that of Oblique Forests. We'll see what jovo thinks. This probably will require a bit more work + testing, so idk if I have the time to dedicate to this. |
@adam2392 Thank you for your insights, Adam! This is a definitely interesting project and some guidance on your end will certainly keep the ball rolling 😄 |
Another note from the sklearn devs: in order to add the binning capabilities to the trees, most likely, we would need to enable it for all trees, meaning RandomForests and ExtraTrees. Then I think we would want to show that there is a relevant performance/efficiency tradeoff. ^ this seems like quite a bit of work, so I think a simple sketch of what the code needs to look like is ideal first and then discussing with sklearn devs is the next step. Ideally someone from NDD can do this because it's a straightforward coding project imo (assuming you know some Cython and how to structure Python files). |
Seems like The issue is that https://github.com/scikit-learn/scikit-learn/blob/b75b5aaddef89931d372ab6a22759eaf8385e6b1/sklearn/ensemble/_hist_gradient_boosting/binning.py#L276-L279 also shows that the BinMapper transforms the datatype to uint8 to conserve RAM, but this would require generalizing the Tree Splitter to accept uint8 DTYPEs, which is not trivial. I think for now to enable binning at this scale, we can just sacrifice RAM (i.e. create a copy of the dataset that is binned that is of np.float64, which obviously would be more RAM usage unnecessarily). A path to getting around this is adding a fused type definition for places like this: https://github.com/scikit-learn/scikit-learn/blob/b75b5aaddef89931d372ab6a22759eaf8385e6b1/sklearn/tree/_splitter.pyx#L708 This way, the class can work for both DTYPE_t (i.e. np.float64) and e.g. uint8. |
#### Reference Issues/PRs Closes: #25 Closes: #23 #### What does this implement/fix? Explain your changes. Adds preliminary capability for binning features. The cons is we need to "bin" again during predict time. I've documented how we can get around this in the README, but it will involve some heavy-duty Cython coding. - Remove Oblique tree models and migrated to scikit-tree - Fix up CI so that way unnecessary workflows are not ran on the fork - Updated the documentation with the current limitation of binning --------- Signed-off-by: Adam Li <adam2392@gmail.com>
Describe the workflow you want to enable
I am part of the @neurodata team. Binning features have resulted in highly efficient and little loss in performance in gradient-boosted trees. This feature should not only be used in gradient-boosted trees, but should be available within all decision trees [1].
By including binning as a feature for decision trees, we would enable massive speedups for decision trees that operate on high-dimensional data (both in features and sample sizes). This would be an additional tradeoff that users can take. The intuition behind binning for decision trees would be exactly that of Gradient Boosted Trees.
Describe your proposed solution
We propose introducing binning to the decision tree classifier and regressor.
An initial PR is proposed here: #24 (review)
However, it seems that many of the files were copied and it is not 100% clear if needed. Perhaps we can explore how to consolidate the
_binning.py/pyx
files using the current versions underensemble/_hist_gradient_boosting/*
.Changes to the Cython codebase
TBD
Changes to the Python API
The following two parameters would be added to the DecisionTreeClassifier and Regressor:
where the default number of bins follows that of histgradient boosting.
Additional context
These changes can also trivially be applied to Oblique Trees.
References:
[1] https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree
The text was updated successfully, but these errors were encountered: