- preprocessing: scaling and encoding, example set of routines here
-
A useful top level enumeration of functionality for reference is the sklearn top level API doc.
-
cross validation
-
provide helpers to separate a dataset into train, test, validate
-
k-fold cross validation helpers (also stratified, maintaining the same class balance as in the full set)
- e.g. from sklearn.model_selection import StratifiedKFold
-
sklearn
can be verbose and painful here, would be nice to destructure by position or keys as in a doseq or let for a particular scope defined by the kfold (macro?) -
Manual indexing in particular is painful, a destructuring in a
let
style way to:- indicate that you're in the scope of a k-fold cv that will iterate over ks
- indicate in-fold and out-of-fold partitions of the data
- be able to make out-of-fold predictions (e.g. for ensembles where you need to avoid information leakage)
-
grid search
- exhaustive search across parameter space
- random sampling search
- smart sampling using Gaussian processes or some kind of importance sampling
- gaussian process optimization ala spearmint?
- model serialization
- extend model serialization protocols to other forms of models
- metrics: assess prediction errors for different model types
- classification
- multilabel ranking
- regression
- clustering
- AdaBoost
- hooks for JavaXGBoost?
- implement multi-output meta-regression that trains many single output regression models and then combines them into a meta-model
- MultiOutputRegressor a la sklearn:
from sklearn.datasets import make_regression
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
X, y = make_regression(n_samples=10, n_targets=3, random_state=1)
MultiOutputRegressor(GradientBoostingRegressor(random_state=0)).fit(X, y).predict(X)
-
A meta-ensembler that supports boosting/bagging/blending methods as described here
- Something like this could be a component of a dataflow graph whose execution model could support caching.
-
sklearn uses (fit ) and (estimate )
- further breakdown:
- all models
fit
- PCA, etc.
transform
- preprocessors (e.g. linear scaling) also use
transform
- some regressors
estimate
- classifiers implement
predict
andpredict_proba
(goofy name, class probability) - clustering a la K Means, GMM use
predict
as well.
- all models
- further breakdown:
- linear autoencoder (~ PCA)
- PCA
- ICA
- non-linear autoencoder
- sparse autoencoder
- denoising autoencoder
- t-SNE
You want parameter search methods to have access to:
- Model generation (using hyperparameters), e.g. HOF/Ctor for Model, parameters of inputs
- Data flow description (preprocessing first, etc.?)
- Accuracy/Loss (model level differs from NN optimizers, useful for scoring models in ensembles a la Boosting model weights as well as parameter search.)
These abstractions could be protocol/interface level, data descriptors a la spec
, etc. Ideally
you could pipe any of this data flow description into a DAG:
- An execution model that can cache or otherwise take advantage of previously executed work.
- Kagglers manually hack this in by ensembling CSVs, example
- An execution model that can evaluate model components for search in parallel as appropriate.