ENH add partial_fit for DecisionTreeClassifier #50

PSSF23 · 2023-08-09T13:47:37Z

@adam2392 I have revamped the update cython function into build as we discussed in #35. Right now I only modified DepthFirstTreeBuilder as there's no way to control max depth in streaming trees.

This reverts commit 6c0a3e3.

PSSF23 · 2023-08-09T15:06:01Z

Thanks for the notes! I'm still cleaning up the merging conflicts and will get back to your reviews once I have the code running again.

PSSF23

@adam2392 In _update_node the tree finds the node id of the current node instead of adding a new one. In my opinion, combining the two functions would not necessarily simplify the code as there would be many if statements..

But _add_node looks almost exactly the same. Wouldn't it be easier to add an overwrite=False to the _add_node function?

sklearn/tree/_tree.pxd

adam2392 · 2023-08-09T18:45:17Z

sklearn/tree/_tree.pyx

+                with gil:
+                    if parent in self.initial_roots:
+                        node_id = tree._update_node(parent, is_left, is_leaf,
+                                                    split_ptr, impurity, n_node_samples,
+                                                    weighted_n_node_samples,
+                                                    split.missing_go_to_left)
+                    else:
+                        node_id = tree._add_node(parent, is_left, is_leaf,
+                                                 split_ptr, impurity, n_node_samples,
+                                                 weighted_n_node_samples,
+                                                 split.missing_go_to_left)


This will significantly slow down the tree building because you have to access the GIL again.

If this if parent in self.initial_roots check is necessary, then I think we should just build a C++ hashmap instead, so we don't need the GIL

It's an annoying thing as we have node splitting and node updating mixed together. I agree that it's time consuming, and will work on simplifying it. Which way you think is more applicable, separating the node updates or pushing an indicator onto the stack?

Indicator seems reasonable?

Alternatively, why not just make initial_roots a hashmap? Using chatgpt, the code seems not too bad:

# hashmap_cython.pyx # Import necessary Cython headers from libcpp.unordered_map cimport unordered_map # myclass is the treebuilder cdef class MyClass: cdef unordered_map[int, int] map def __cinit__(self): self.map = unordered_map[int, int]() cpdef int insert(self, int key, int value) nogil: # Use NOGIL to allow direct access to the unordered_map self.map[key] = value return 1 cpdef int find(self, int key) nogil: # Use NOGIL to allow direct access to the unordered_map try: return self.map.at(key) except KeyError: return -1

It seems very complicated to use tuple as hashmap keys, so I separated the node updates and node additions.

What is initial roots storing? Hash map of node id as key to what as value?

Just thinking out diff approaches. If the code as is works now then it's just ideas on how to simplify the implementation so we don't have maintainence work constantly

One advantage of the update method is that it would not be affected as much by any upstream changes. Any changes to build is unlikely to affect it.

True. The issue is that will you be able to account for monotonic constraints and any other added features easily tho?

I'm okay w/ an implementation as a separate method actually for now tho just to demonstrate that RF, OF and MORF works as we might expect. I can help you refactor into a consolidated version later on. How does that sound?

This strategy requires you to verify that it works tho on OF and other downstream trees tho.

As I already implemented the initial roots method I would stick with it for now. It seems that the errors come from fit underperforming, which my code should have no effects on. I'll compare it with the sklearn code tomorrow and see if I can resolve it. It is affecting both the update method and the build/roots method.

Let's see if #52 will fix it

Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>

sklearn/model_selection/_split.py

sklearn/metrics/tests/test_classification.py

Merging latest changes from sklearn main #### What does this implement/fix? Explain your changes. #### Any other comments?  --------- Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-08-11T14:35:40Z

Kay I cleaned the diff by merging in changes from sklearn:main to submodulev2 and then to this branch.

Do you have unit-tests you can port from your old PR? The CIs can help check these whenever we get new stuff that way.

…fier (#54) Supersedes: #50 Implements partial_fit API for all classification decision trees. --------- Signed-off-by: Adam Li <adam2392@gmail.com> Co-authored-by: Haoyin Xu <haoyinxu@gmail.com>

PSSF23 added 30 commits November 6, 2020 20:25

Start implementing the update function for trees

1542765

Update _tree.pxd

8ded0f7

Remove unused attribute

d6d5879

Remove duplicate operations

0ed0819

Keep whole function for reference

bebe2bc

Catch AttributeError

6ca6725

Evaluate tree building logic

a403f5b

Follow node addition logic

cb4cf43

Work with counting issues and overflowing trees

eb7af31

Work with high variability

c24c87a

Fix y coordinates

5e6685c

Duplicate sample organization

5f6c373

Add _update_split_node function for BestFirstTree

7ac15f2

Work without max_leaf_nodes limit

2a94fa2

Update .gitignore

d6c03a7

Remove capacity resetting

7a3985a

Resolve 1 node tree problem

4f8605e

Optimize node order

11764a1

Update _tree.pyx

02ca737

Optimize partial_fit api

92f7e18

Update from main branch to stream branch

ab51a53

Fix linting

f05a3b2

FIX add __reduce__ functions

e1b6658

Merge branch 'main' into stream

f1a4174

FIX black format the code

0a5420c

FIX remove min_impurity_split

19893c3

FIX update deprecated attribute

fdd1dfd

FIX optimize api & correct __cinit__

b4cbfa4

FIX optimize first partial_fit test

8f4b664

FIX remove FutureWarning filter

3562219

Revert "FIX update numpy usage"

db66c7b

This reverts commit 6c0a3e3.

PSSF23 added 3 commits August 9, 2023 11:08

FIX resolve conflicts

c868e9e

FIX correct nogil location

c42a29c

FIX remove deprecated import

af2918f

PSSF23 mentioned this pull request Aug 9, 2023

Add partial_fit function to decision trees #35

Closed

PSSF23 added 3 commits August 9, 2023 12:38

FIX correct styles & variable ref

876285c

FIX remove deprecated import & warnings

1599018

FIX optimize conditional statement

f20a4fa

PSSF23 commented Aug 9, 2023

View reviewed changes

PSSF23 added 3 commits August 9, 2023 13:19

FIX optimize cython method

f556e0b

FIX optimize refitting conditions

8b226e7

FIX correct type comparison methods

d88dd28

adam2392 reviewed Aug 9, 2023

View reviewed changes

sklearn/tree/_tree.pxd Outdated Show resolved Hide resolved

adam2392 reviewed Aug 9, 2023

View reviewed changes

PSSF23 and others added 5 commits August 9, 2023 15:46

DOC update cython variable

c894f60

ENH optimize efficiency & FIX correct param order

6907a5f

Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>

FIX correct conditional statement for splitter

35b2def

FIX correct splitting start sample position

0133ee1

FIX remove duplicate method

d4d677e

adam2392 reviewed Aug 11, 2023

View reviewed changes

sklearn/model_selection/_split.py Outdated Show resolved Hide resolved

sklearn/metrics/tests/test_classification.py Outdated Show resolved Hide resolved

adam2392 added 2 commits August 11, 2023 10:33

Merge branch 'submodulev2' into reed

423fa49

adam2392 changed the base branch from submodulev2 to submodulev3 August 11, 2023 14:49

Merge branch 'submodulev3' into reed

fcc0758

adam2392 mentioned this pull request Aug 11, 2023

[ENH v2] Add partial fit to the correct branch for decisiontreeclassifier #54

Merged

PSSF23 closed this Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH add partial_fit for DecisionTreeClassifier #50

ENH add partial_fit for DecisionTreeClassifier #50

PSSF23 commented Aug 9, 2023

PSSF23 commented Aug 9, 2023

PSSF23 left a comment •

edited by adam2392

Loading

adam2392 Aug 9, 2023

PSSF23 Aug 9, 2023 •

edited

Loading

adam2392 Aug 9, 2023

PSSF23 Aug 10, 2023

adam2392 Aug 10, 2023

adam2392 Aug 10, 2023

PSSF23 Aug 10, 2023 •

edited

Loading

adam2392 Aug 11, 2023 •

edited

Loading

PSSF23 Aug 11, 2023

adam2392 Aug 11, 2023

adam2392 commented Aug 11, 2023

ENH add partial_fit for DecisionTreeClassifier #50

ENH add partial_fit for DecisionTreeClassifier #50

Conversation

PSSF23 commented Aug 9, 2023

PSSF23 commented Aug 9, 2023

PSSF23 left a comment • edited by adam2392 Loading

Choose a reason for hiding this comment

adam2392 Aug 9, 2023

Choose a reason for hiding this comment

PSSF23 Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

adam2392 Aug 9, 2023

Choose a reason for hiding this comment

PSSF23 Aug 10, 2023

Choose a reason for hiding this comment

adam2392 Aug 10, 2023

Choose a reason for hiding this comment

adam2392 Aug 10, 2023

Choose a reason for hiding this comment

PSSF23 Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

adam2392 Aug 11, 2023 • edited Loading

Choose a reason for hiding this comment

PSSF23 Aug 11, 2023

Choose a reason for hiding this comment

adam2392 Aug 11, 2023

Choose a reason for hiding this comment

adam2392 commented Aug 11, 2023

PSSF23 left a comment •

edited by adam2392

Loading

PSSF23 Aug 9, 2023 •

edited

Loading

PSSF23 Aug 10, 2023 •

edited

Loading

adam2392 Aug 11, 2023 •

edited

Loading