HyperImpute simplifies the selection process of a data imputation algorithm for your ML pipelines. It includes various novel algorithms for missing data and is compatible with sklearn.
- π Fast and extensible dataset imputation algorithms, compatible with sklearn.
- π New iterative imputation method: HyperImpute.
- π Classic methods: MICE, MissForest, GAIN, MIRACLE, MIWAE, Sinkhorn, SoftImpute, etc.
- π₯ Pluginable architecture.
The library can be installed from PyPI using
$ pip install hyperimpute
or from source, using
$ pip install .
List available imputers
from hyperimpute.plugins.imputers import Imputers
imputers = Imputers()
imputers.list()
Impute a dataset using one of the available methods
import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers
X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
method = "gain"
plugin = Imputers().get(method)
out = plugin.fit_transform(X.copy())
print(method, out)
Specify the baseline models for HyperImpute
import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers
X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
plugin = Imputers().get(
"hyperimpute",
optimizer="hyperband",
classifier_seed=["logistic_regression"],
regression_seed=["linear_regression"],
)
out = plugin.fit_transform(X.copy())
print(out)
Use an imputer with a SKLearn pipeline
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from hyperimpute.plugins.imputers import Imputers
X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])
imputer = Imputers().get("hyperimpute")
estimator = Pipeline(
[
("imputer", imputer),
("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
]
)
estimator.fit(X, y)
Write a new imputation plugin
from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin
imputers = Imputers()
knn_imputer = "custom_knn"
class KNN(ImputerPlugin):
def __init__(self) -> None:
super().__init__()
self._model = KNNImputer(n_neighbors=2, weights="uniform")
@staticmethod
def name():
return knn_imputer
@staticmethod
def hyperparameter_space():
return []
def _fit(self, *args, **kwargs):
self._model.fit(*args, **kwargs)
return self
def _transform(self, *args, **kwargs):
return self._model.transform(*args, **kwargs)
imputers.add(knn_imputer, KNN)
assert imputers.get(knn_imputer) is not None
Benchmark imputation models on a dataset
from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models
X, y = load_iris(as_frame=True, return_X_y=True)
imputer = Imputers().get("hyperimpute")
compare_models(
name="example",
evaluated_model=imputer,
X_raw=X,
ref_methods=["ice", "missforest"],
scenarios=["MAR"],
miss_pct=[0.1, 0.3],
n_iter=2,
)
The following table contains the default imputation plugins:
Strategy | Description | Code |
---|---|---|
HyperImpute | Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets | plugin_hyperimpute.py |
Mean | Replace the missing values using the mean along each column with SimpleImputer |
plugin_mean.py |
Median | Replace the missing values using the median along each column with SimpleImputer |
plugin_median.py |
Most-frequent | Replace the missing values using the most frequent value along each column with SimpleImputer |
plugin_most_freq.py |
MissForest | Iterative imputation method based on Random Forests using IterativeImputer and ExtraTreesRegressor |
plugin_missforest.py |
ICE | Iterative imputation method based on regularized linear regression using IterativeImputer and BayesianRidge |
plugin_ice.py |
MICE | Multiple imputations based on ICE using IterativeImputer and BayesianRidge |
plugin_mice.py |
SoftImpute | Low-rank matrix approximation via nuclear-norm regularization |
plugin_softimpute.py |
EM | Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - EM imputation algorithm |
plugin_em.py |
Sinkhorn | Missing Data Imputation using Optimal Transport |
plugin_sinkhorn.py |
GAIN | GAIN: Missing Data Imputation using Generative Adversarial Nets |
plugin_gain.py |
MIRACLE | MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms |
plugin_miracle.py |
MIWAE | MIWAE: Deep Generative Modelling and Imputation of Incomplete Data |
plugin_miwae.py |
Install the testing dependencies using
pip install .[testing]
The tests can be executed using
pytest -vsx
If you use this code, please cite the associated paper:
@article{Jarrett2022HyperImpute,
doi = {10.48550/ARXIV.2206.07769},
url = {https://arxiv.org/abs/2206.07769},
author = {Jarrett, Daniel and Cebere, Bogdan and Liu, Tennison and Curth, Alicia and van der Schaar, Mihaela},
keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {HyperImpute: Generalized Iterative Imputation with Automatic Model Selection},
year = {2022},
booktitle={39th International Conference on Machine Learning},
}