Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datawig.SimpleImputer.complete is not imputing any columns #153

Open
imsazzad opened this issue Jul 23, 2021 · 2 comments
Open

datawig.SimpleImputer.complete is not imputing any columns #153

imsazzad opened this issue Jul 23, 2021 · 2 comments

Comments

@imsazzad
Copy link

imsazzad commented Jul 23, 2021

df_with_missing = prepare_training_data().iloc[:, : 12]
print("Null value in every column\n", df_with_missing.isnull().sum(axis=0))

# impute missing values
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing, precision_threshold=0.8)
print("Null value in every column\n", df_with_missing_imputed.isnull().sum(axis=0))

mainly two problems

  1. Null values are the same before and after running model
  2. If I run with the above 12 features, it is taking indefinite time to run ( I ran the code for 30 minutes, and still running)

versions

  • python 3.7.11
  • sklearn-pandas==1.8.0 numpy==1.14.6 pandas==0.25.3 scikit-learn==0.22.1
  • mxnet==1.4.0
  • datawig==0.2.0

I have string, float, and integer data as input
Am I missing something?
@felixbiessmann

@maqboolkhan
Copy link

Facing the same problem.

@felixbiessmann
Copy link
Contributor

felixbiessmann commented Dec 10, 2021

When the precision threshold is set to values above 0.0 datawig will only impute values when it is 'certain' enough that its imputations will be correct, based on a precision threshold. If you set that threshold to 0.8, this means that only for imputations that reached 0.8, on an independent validation set, you will get an imputation. This threshold is calibrated for each value separately. So if datawig cannot impute values with reasonably high precision, you will have Nones/NaNs. If you'd like to have more imputations (with lower precision), you can lower the precision threshold.

As for the long runtime: the model selection / hyperparameter optimization can take a long time. You can try turning off the hpo or reduce the number of dimensions when calling complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants