-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add partial_fit function to decision trees #35
Conversation
y, | ||
sample_weight=None, | ||
check_input=True, | ||
classes=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add classes
keyword argument, or is this due to an outdated API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to accommodate the tree building. Not sklearn api I think. The tree needs to know all the potential classes from the beginning, as not all classes are necessarily included in the first batch of data. If new classes are added later there's errors on y dimensions.
builder_ : TreeBuilder instance | ||
The underlying TreeBuilder object. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any state preserved in the initial builder that we want to preserve when we call update?
I.e. do we need to preserve the builder necessarily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to check but maybe not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The builder
object is normally initialized in fit
and contains some parameters like min_samples_split
, splitter
and criterion
. The initialization (checks and stuff) of these parameters include many lines of code, so I'll just leave the object as it is for now. We can optimize it later by modulating those steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand a bit more what needs to be changed. I think if you want a light-weight addition, you can implement:
- an extra cpdef method to the "builder":
initialize_node_queue
, which a list of false_roots, which we will callinitial_roots
. To determine a node, all you store infalse_roots
seems to be the "parent node" and the "is_left" flag(?), so this can just be a list of tuple of integer and bool. - add
initial_roots
as an attribute to the builder class. This should be instantiated during_cinit_
to some empty data structure. - just modify the existing
build
function to check ifinitial_roots
is empty, if so, then build as is, otherwise, add the false roots to the PriorityHeap, or Stack respectively depending on if it is the BestFirstBuilder, or DepthFirstBuilder. This is the "extra code" you have to initialize the queues in each builder
In this way, partial_fit
can just call initialize_node_queue
before calling build
, which handles initializing a queue of the false roots. WDYT?
Looks good to me. Thanks a lot. I will see if the technical part can work out this way. How do you feel about the partial_fit method in |
Not as familiar w/ the API for partial_fit, but it looks pretty simple, so I would say it should be okay. |
FYI @PSSF23 I've migrated the entire set of changes into the |
Thanks @adam2392 ! I was focusing on icml reviews and will have more time to optimize the code. |
Started working on the root initialization and method modulation. The original My current attempt begins by updating |
I don't think it matters. You only use
Yeah that sounds right I think? Cuz the builder just needs to keep track of where you are in the tree. The splitter just splits the data that it's given and pumps out a new node. The builder then should know where to put the node and make sure the tree is connected well. Perhaps start with writing the By the way, I would also start from |
Actually, I screwed up that branch currently cuz I was trying to incorporate the latest API changes in scikit-learn 1.4dev0 (i.e. monotonicity constraints in trees), but it didn't work. I would start a new branch off of You can name it |
Updated version in #50 |
Reference Issues/PRs
neurodata/treeple#40
What does this implement/fix? Explain your changes.
DecisionTreeClassifier
to incrementally update the tree structure with new data batchesAny other comments?
@adam2392 This is a branch copy I made so feel free to modify it.