-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH add partial_fit for DecisionTreeClassifier #50
Conversation
This reverts commit 6c0a3e3.
Thanks for the notes! I'm still cleaning up the merging conflicts and will get back to your reviews once I have the code running again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adam2392 In
_update_node
the tree finds the node id of the current node instead of adding a new one. In my opinion, combining the two functions would not necessarily simplify the code as there would be many if statements..
But _add_node
looks almost exactly the same. Wouldn't it be easier to add an overwrite=False
to the _add_node
function?
sklearn/tree/_tree.pyx
Outdated
with gil: | ||
if parent in self.initial_roots: | ||
node_id = tree._update_node(parent, is_left, is_leaf, | ||
split_ptr, impurity, n_node_samples, | ||
weighted_n_node_samples, | ||
split.missing_go_to_left) | ||
else: | ||
node_id = tree._add_node(parent, is_left, is_leaf, | ||
split_ptr, impurity, n_node_samples, | ||
weighted_n_node_samples, | ||
split.missing_go_to_left) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will significantly slow down the tree building because you have to access the GIL again.
If this if parent in self.initial_roots check is necessary, then I think we should just build a C++ hashmap instead, so we don't need the GIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an annoying thing as we have node splitting and node updating mixed together. I agree that it's time consuming, and will work on simplifying it. Which way you think is more applicable, separating the node updates or pushing an indicator onto the stack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indicator seems reasonable?
Alternatively, why not just make initial_roots
a hashmap? Using chatgpt, the code seems not too bad:
# hashmap_cython.pyx
# Import necessary Cython headers
from libcpp.unordered_map cimport unordered_map
# myclass is the treebuilder
cdef class MyClass:
cdef unordered_map[int, int] map
def __cinit__(self):
self.map = unordered_map[int, int]()
cpdef int insert(self, int key, int value) nogil:
# Use NOGIL to allow direct access to the unordered_map
self.map[key] = value
return 1
cpdef int find(self, int key) nogil:
# Use NOGIL to allow direct access to the unordered_map
try:
return self.map.at(key)
except KeyError:
return -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems very complicated to use tuple as hashmap keys, so I separated the node updates and node additions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is initial roots storing? Hash map of node id as key to what as value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking out diff approaches. If the code as is works now then it's just ideas on how to simplify the implementation so we don't have maintainence work constantly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One advantage of the update
method is that it would not be affected as much by any upstream changes. Any changes to build is unlikely to affect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. The issue is that will you be able to account for monotonic constraints and any other added features easily tho?
I'm okay w/ an implementation as a separate method actually for now tho just to demonstrate that RF, OF and MORF works as we might expect. I can help you refactor into a consolidated version later on. How does that sound?
This strategy requires you to verify that it works tho on OF and other downstream trees tho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I already implemented the initial roots method I would stick with it for now. It seems that the errors come from fit
underperforming, which my code should have no effects on. I'll compare it with the sklearn code tomorrow and see if I can resolve it. It is affecting both the update method and the build/roots method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if #52 will fix it
Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>
Merging latest changes from sklearn main #### What does this implement/fix? Explain your changes. #### Any other comments? <!-- Please be aware that we are a loose team of volunteers so patience is necessary; assistance handling other issues is very welcome. We value all user contributions, no matter how minor they are. If we are slow to review, either the pull request needs some benchmarking, tinkering, convincing, etc. or more likely the reviewers are simply busy. In either case, we ask for your understanding during the review process. For more information, see our FAQ on this topic: http://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention. Thanks for contributing! --> --------- Signed-off-by: Adam Li <adam2392@gmail.com>
Kay I cleaned the diff by merging in changes from Do you have unit-tests you can port from your old PR? The CIs can help check these whenever we get new stuff that way. |
@adam2392 I have revamped the
update
cython function intobuild
as we discussed in #35. Right now I only modifiedDepthFirstTreeBuilder
as there's no way to control max depth in streaming trees.