Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonlinear regression simulations for existing split criteria #29

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

morgsmss7
Copy link

@morgsmss7 morgsmss7 commented Feb 24, 2020

Reference Issues/PRs

Fixes Issue 16370 in scikit-learn. Also see Issue 2 in tealeaf.

What does this implement/fix? Explain your changes.

This PR adds simulations and plots that show how split criteria compare on several nonlinear regression simulations including sinusoidal, logarithmic, multiplicative, and independence. There is not much information on scikit-learn's documentation about how to go about choosing which to use (mse, mae, or friedman mse) for the criterion parameter. This example demonstrates how to go about finding differences and shows that it may not always matter which criterion is chosen.

Any other comments?

This PR in sklearn will include these files:

sklearn/datasets/tests/test_samples_generator.py 
sklearn/datasets/samples_generator.py
sklearn/datasets/__init__.py
examples/ensemble/plot_random_forest_regression_criteria_comparison.py
examples/datasets/plot_nonlinear_regression_datasets.py 
doc/modules/classes.rst 
doc/datasets/index.rst

The other files that were changed for Vivek's PR will not be changed in sklearn.

@morgsmss7
Copy link
Author

morgsmss7 commented Feb 24, 2020

Split Criteria Comparison Experiment and Results

For each simulation type (Logarithmic, Sine, Square, Multiplicative, Independence)
And For each split criterion (mse, mae, friedman mse)

  1. Generate 30 noisy training sets (10 dimensions, 50 samples)
  2. Generate 1 noisy test set (10 dimensions, 1000 samples)
  3. Train and evaluate using mse for all 30 training sets with random forests (500 trees) varying number of samples ( np.arange(5, 51, 3) )

nonlinearSimPlots
splitter_comparison_02_17

Copy link

@eigenvivek eigenvivek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks great! I made some minor comments asking for stylistic changes to the comments.

@eigenvivek eigenvivek changed the title Nonlinear regression simulations for Existing Split Criteria Nonlinear regression simulations for existing split criteria Mar 17, 2020
@j1c
Copy link
Member

j1c commented Mar 30, 2020

This looks great. I think you wanted to make a PR to real sklearn right? My only concern is that the current NDD master has bunch of changes from other people that would be merged in as well. For this, I think the best course of action is to make a new branch, fetch the latest sklearn, add in these two examples along with the data generation code. Then make the PR from that branch.

@morgsmss7 morgsmss7 dismissed eigenvivek’s stale review May 14, 2020 19:30

These changes have been made in both the real sklearn version and the NDD version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants