Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add partial_fit for DecisionTreeClassifier #50

Closed
wants to merge 147 commits into from

Conversation

PSSF23
Copy link
Member

@PSSF23 PSSF23 commented Aug 9, 2023

@adam2392 I have revamped the update cython function into build as we discussed in #35. Right now I only modified DepthFirstTreeBuilder as there's no way to control max depth in streaming trees.

@PSSF23
Copy link
Member Author

PSSF23 commented Aug 9, 2023

Thanks for the notes! I'm still cleaning up the merging conflicts and will get back to your reviews once I have the code running again.

Copy link
Member Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adam2392 In _update_node the tree finds the node id of the current node instead of adding a new one. In my opinion, combining the two functions would not necessarily simplify the code as there would be many if statements..

But _add_node looks almost exactly the same. Wouldn't it be easier to add an overwrite=False to the _add_node function?

sklearn/tree/_tree.pxd Outdated Show resolved Hide resolved
Comment on lines 402 to 412
with gil:
if parent in self.initial_roots:
node_id = tree._update_node(parent, is_left, is_leaf,
split_ptr, impurity, n_node_samples,
weighted_n_node_samples,
split.missing_go_to_left)
else:
node_id = tree._add_node(parent, is_left, is_leaf,
split_ptr, impurity, n_node_samples,
weighted_n_node_samples,
split.missing_go_to_left)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will significantly slow down the tree building because you have to access the GIL again.

If this if parent in self.initial_roots check is necessary, then I think we should just build a C++ hashmap instead, so we don't need the GIL

Copy link
Member Author

@PSSF23 PSSF23 Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an annoying thing as we have node splitting and node updating mixed together. I agree that it's time consuming, and will work on simplifying it. Which way you think is more applicable, separating the node updates or pushing an indicator onto the stack?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indicator seems reasonable?

Alternatively, why not just make initial_roots a hashmap? Using chatgpt, the code seems not too bad:

# hashmap_cython.pyx

# Import necessary Cython headers
from libcpp.unordered_map cimport unordered_map

# myclass is the treebuilder
cdef class MyClass:
    cdef unordered_map[int, int] map

    def __cinit__(self):
        self.map = unordered_map[int, int]()

    cpdef int insert(self, int key, int value) nogil:
        # Use NOGIL to allow direct access to the unordered_map
        self.map[key] = value
        return 1

    cpdef int find(self, int key) nogil:
        # Use NOGIL to allow direct access to the unordered_map
        try:
            return self.map.at(key)
        except KeyError:
            return -1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems very complicated to use tuple as hashmap keys, so I separated the node updates and node additions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is initial roots storing? Hash map of node id as key to what as value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out diff approaches. If the code as is works now then it's just ideas on how to simplify the implementation so we don't have maintainence work constantly

Copy link
Member Author

@PSSF23 PSSF23 Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One advantage of the update method is that it would not be affected as much by any upstream changes. Any changes to build is unlikely to affect it.

Copy link
Collaborator

@adam2392 adam2392 Aug 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. The issue is that will you be able to account for monotonic constraints and any other added features easily tho?

I'm okay w/ an implementation as a separate method actually for now tho just to demonstrate that RF, OF and MORF works as we might expect. I can help you refactor into a consolidated version later on. How does that sound?

This strategy requires you to verify that it works tho on OF and other downstream trees tho.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I already implemented the initial roots method I would stick with it for now. It seems that the errors come from fit underperforming, which my code should have no effects on. I'll compare it with the sklearn code tomorrow and see if I can resolve it. It is affecting both the update method and the build/roots method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if #52 will fix it

sklearn/model_selection/_split.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_classification.py Outdated Show resolved Hide resolved
Merging latest changes from sklearn main

#### What does this implement/fix? Explain your changes.


#### Any other comments?


<!--
Please be aware that we are a loose team of volunteers so patience is
necessary; assistance handling other issues is very welcome. We value
all user contributions, no matter how minor they are. If we are slow to
review, either the pull request needs some benchmarking, tinkering,
convincing, etc. or more likely the reviewers are simply busy. In either
case, we ask for your understanding during the review process.
For more information, see our FAQ on this topic:

http://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.

Thanks for contributing!
-->

---------

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator

Kay I cleaned the diff by merging in changes from sklearn:main to submodulev2 and then to this branch.

Do you have unit-tests you can port from your old PR? The CIs can help check these whenever we get new stuff that way.

@adam2392 adam2392 changed the base branch from submodulev2 to submodulev3 August 11, 2023 14:49
@PSSF23 PSSF23 closed this Aug 11, 2023
adam2392 added a commit that referenced this pull request Aug 14, 2023
…fier (#54)

Supersedes: #50 

Implements partial_fit API for all classification decision trees.

---------

Signed-off-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Haoyin Xu <haoyinxu@gmail.com>
adam2392 added a commit that referenced this pull request Sep 8, 2023
…fier (#54)

Supersedes: #50 

Implements partial_fit API for all classification decision trees.

---------

Signed-off-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Haoyin Xu <haoyinxu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants