Skip to content


Repository files navigation

HyperImpute - A library for NaNs and nulls.

Test In Colab Tests PR Tests Full Tutorials Documentation Status

arXiv License: MIT Python 3.7+ slack


HyperImpute simplifies the selection process of a data imputation algorithm for your ML pipelines. It includes various novel algorithms for missing data and is compatible with sklearn.

HyperImpute features

  • πŸš€ Fast and extensible dataset imputation algorithms, compatible with sklearn.
  • πŸ”‘ New iterative imputation method: HyperImpute.
  • πŸŒ€ Classic methods: MICE, MissForest, GAIN, MIRACLE, MIWAE, Sinkhorn, SoftImpute, etc.
  • πŸ”₯ Pluginable architecture.

πŸš€ Installation

The library can be installed from PyPI using

$ pip install hyperimpute

or from source, using

$ pip install .

πŸ’₯ Sample Usage

List available imputers

from hyperimpute.plugins.imputers import Imputers

imputers = Imputers()


Impute a dataset using one of the available methods

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

method = "gain"

plugin = Imputers().get(method)
out = plugin.fit_transform(X.copy())

print(method, out)

Specify the baseline models for HyperImpute

import pandas as pd
import numpy as np
from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])

plugin = Imputers().get(

out = plugin.fit_transform(X.copy())

Use an imputer with a SKLearn pipeline

import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

from hyperimpute.plugins.imputers import Imputers

X = pd.DataFrame([[1, 1, 1, 1], [4, 5, np.nan, np.nan], [3, 3, 9, 9], [2, 2, 2, 2]])
y = pd.Series([1, 2, 1, 2])

imputer = Imputers().get("hyperimpute")

estimator = Pipeline(
        ("imputer", imputer),
        ("forest", RandomForestRegressor(random_state=0, n_estimators=100)),
), y)

Write a new imputation plugin

from sklearn.impute import KNNImputer
from hyperimpute.plugins.imputers import Imputers, ImputerPlugin

imputers = Imputers()

knn_imputer = "custom_knn"

class KNN(ImputerPlugin):
    def __init__(self) -> None:
        self._model = KNNImputer(n_neighbors=2, weights="uniform")

    def name():
        return knn_imputer

    def hyperparameter_space():
        return []

    def _fit(self, *args, **kwargs):*args, **kwargs)
        return self

    def _transform(self, *args, **kwargs):
        return self._model.transform(*args, **kwargs)

imputers.add(knn_imputer, KNN)

assert imputers.get(knn_imputer) is not None

Benchmark imputation models on a dataset

from sklearn.datasets import load_iris
from hyperimpute.plugins.imputers import Imputers
from hyperimpute.utils.benchmarks import compare_models

X, y = load_iris(as_frame=True, return_X_y=True)

imputer = Imputers().get("hyperimpute")

    ref_methods=["ice", "missforest"],
    miss_pct=[0.1, 0.3],

πŸ““ Tutorials

⚑ Imputation methods

The following table contains the default imputation plugins:

Strategy Description Code
HyperImpute Iterative imputer using both regression and classification methods based on linear models, trees, XGBoost, CatBoost and neural nets
Mean Replace the missing values using the mean along each column with SimpleImputer
Median Replace the missing values using the median along each column with SimpleImputer
Most-frequent Replace the missing values using the most frequent value along each column with SimpleImputer
MissForest Iterative imputation method based on Random Forests using IterativeImputer and ExtraTreesRegressor
ICE Iterative imputation method based on regularized linear regression using IterativeImputer and BayesianRidge
MICE Multiple imputations based on ICE using IterativeImputer and BayesianRidge
SoftImpute Low-rank matrix approximation via nuclear-norm regularization
EM Iterative procedure which uses other variables to impute a value (Expectation), then checks whether that is the value most likely (Maximization) - EM imputation algorithm
Sinkhorn Missing Data Imputation using Optimal Transport
GAIN GAIN: Missing Data Imputation using Generative Adversarial Nets
MIRACLE MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms
MIWAE MIWAE: Deep Generative Modelling and Imputation of Incomplete Data

πŸ”¨ Tests

Install the testing dependencies using

pip install .[testing]

The tests can be executed using

pytest -vsx


If you use this code, please cite the associated paper:

  doi = {10.48550/ARXIV.2206.07769},
  url = {},
  author = {Jarrett, Daniel and Cebere, Bogdan and Liu, Tennison and Curth, Alicia and van der Schaar, Mihaela},
  keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {HyperImpute: Generalized Iterative Imputation with Automatic Model Selection},
  year = {2022},
  booktitle={39th International Conference on Machine Learning},