Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partial_fit function to decision trees #35

Closed
wants to merge 58 commits into from

Conversation

PSSF23
Copy link
Member

@PSSF23 PSSF23 commented Mar 3, 2023

Reference Issues/PRs

neurodata/treeple#40

What does this implement/fix? Explain your changes.

  • Allow DecisionTreeClassifier to incrementally update the tree structure with new data batches

Any other comments?

@adam2392 This is a branch copy I made so feel free to modify it.

y,
sample_weight=None,
check_input=True,
classes=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add classes keyword argument, or is this due to an outdated API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to accommodate the tree building. Not sklearn api I think. The tree needs to know all the potential classes from the beginning, as not all classes are necessarily included in the first batch of data. If new classes are added later there's errors on y dimensions.

Comment on lines +1323 to +1325
builder_ : TreeBuilder instance
The underlying TreeBuilder object.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any state preserved in the initial builder that we want to preserve when we call update?

I.e. do we need to preserve the builder necessarily?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to check but maybe not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builder object is normally initialized in fit and contains some parameters like min_samples_split, splitter and criterion. The initialization (checks and stuff) of these parameters include many lines of code, so I'll just leave the object as it is for now. We can optimize it later by modulating those steps.

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand a bit more what needs to be changed. I think if you want a light-weight addition, you can implement:

  • an extra cpdef method to the "builder": initialize_node_queue, which a list of false_roots, which we will call initial_roots. To determine a node, all you store in false_roots seems to be the "parent node" and the "is_left" flag(?), so this can just be a list of tuple of integer and bool.
  • add initial_roots as an attribute to the builder class. This should be instantiated during _cinit_ to some empty data structure.
  • just modify the existing build function to check if initial_roots is empty, if so, then build as is, otherwise, add the false roots to the PriorityHeap, or Stack respectively depending on if it is the BestFirstBuilder, or DepthFirstBuilder. This is the "extra code" you have to initialize the queues in each builder

In this way, partial_fit can just call initialize_node_queue before calling build, which handles initializing a queue of the false roots. WDYT?

@PSSF23
Copy link
Member Author

PSSF23 commented Mar 3, 2023

Looks good to me. Thanks a lot. I will see if the technical part can work out this way.

How do you feel about the partial_fit method in classes.py? Can it be simplified too?

@adam2392
Copy link
Collaborator

adam2392 commented Mar 5, 2023

How do you feel about the partial_fit method in classes.py? Can it be simplified too?

Not as familiar w/ the API for partial_fit, but it looks pretty simple, so I would say it should be okay.

@adam2392
Copy link
Collaborator

FYI @PSSF23 I've migrated the entire set of changes into the fork branch, which will serve as the main branch for any PRs into the sklearn fork. I've added documentation regarding changes into the README file. Lmk if you need any assistance finishing this out.

@PSSF23
Copy link
Member Author

PSSF23 commented Mar 29, 2023

Thanks @adam2392 ! I was focusing on icml reviews and will have more time to optimize the code.

@adam2392 adam2392 changed the base branch from tree-featuresv2 to fork April 4, 2023 21:51
@PSSF23
Copy link
Member Author

PSSF23 commented Jul 6, 2023

Started working on the root initialization and method modulation. The original false_roots was a dict object, should I cdef dict self.initial_roots or just object?

My current attempt begins by updating self.splitter with the new X and y, meanwhile storing the leaves in self.initial_roots for builders to use in build.

@adam2392
Copy link
Collaborator

adam2392 commented Jul 6, 2023

Started working on the root initialization and method modulation. The original false_roots was a dict object, should I cdef dict self.initial_roots or just object?

I don't think it matters. You only use false_roots inside Gil code, so it's basically all Python.

My current attempt begins by updating self.splitter with the new X and y, meanwhile storing the leaves in self.initial_roots for builders to use in build.

Yeah that sounds right I think? Cuz the builder just needs to keep track of where you are in the tree. The splitter just splits the data that it's given and pumps out a new node. The builder then should know where to put the node and make sure the tree is connected well.

Perhaps start with writing the initialize_node_queue function and also write a unit-test for it. Will it be cpdef, or cdef? You can write a cpdef wrapper so you can call it in pytest if you plan on making it a cdef function.

By the way, I would also start from submodulev2 branch instead, so you don't break anything by accident.

@adam2392
Copy link
Collaborator

adam2392 commented Jul 6, 2023

By the way, I would also start from submodulev2 branch instead, so you don't break anything by accident.

Actually, I screwed up that branch currently cuz I was trying to incorporate the latest API changes in scikit-learn 1.4dev0 (i.e. monotonicity constraints in trees), but it didn't work. I would start a new branch off of v1.3, which contains the working sklearn submodule we use in scikit-tree.

You can name it v1.4, so it's slated to get added whenever we upgrade to use sklearn v1.4 in our submodule.

@PSSF23
Copy link
Member Author

PSSF23 commented Aug 9, 2023

Updated version in #50

@PSSF23 PSSF23 closed this Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants