Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename/Alias GeneralImputer to MICE #59

Open
ParadaCarleton opened this issue Oct 15, 2023 · 5 comments
Open

Rename/Alias GeneralImputer to MICE #59

ParadaCarleton opened this issue Oct 15, 2023 · 5 comments

Comments

@ParadaCarleton
Copy link

The algorithm listed as GeneralImputer here is more widely-known as MICE (Multiple imputation by chained equations) in statistics. I'm not sure if the name used here is standard in ML, but the lack of a solid MICE implementation is a common complaint in the Julia statistics ecosystem, so I was very surprised to stumble across this pure-Julia implementation of MICE under a completely different name. Would it make sense to either rename or alias GeneralImputer to make this easier to discover?

@sylvaticus
Copy link
Owner

Hmmm... I am aware of the MICE package in R, but there the idea is that the nultiple imputations are "chained" along the whole statistical procedure.
Also I am not super fan of their usage in ml models in general .
The issue is that there is no guarantee on the origin of the differences between the various imputations, there isn't a probabilistic model determining them. Sometimes this even depends on parameters of the imputation algorithm. So the variance between imputations can not be taken as a measure of the quality or trust in the imputation.
But for sure I should add MICE in the models docstring...

@ParadaCarleton
Copy link
Author

Hmmm... I am aware of the MICE package in R, but there the idea is that the nultiple imputations are "chained" along the whole statistical procedure.

I'm not sure what you mean here; sorry 😅 Is this different from GeneralImputer? The docstring is a bit vague.

The issue is that there is no guarantee on the origin of the differences between the various imputations, there isn't a probabilistic model determining them. Sometimes this even depends on parameters of the imputation algorithm. So the variance between imputations can not be taken as a measure of the quality or trust in the imputation.

If you're doing cross-validation or some other resampling strategy, shouldn't that give a good estimate of the model-based uncertainty? Although you could try something fancier (like a Bayesian bootstrap or other ensemble model).

@sylvaticus
Copy link
Owner

sylvaticus commented Nov 10, 2023

You may be interested in this new package: https://github.com/tom-metherell/Mice.jl

Compared to the imputers in BetaML it provides pooling of the analysis you perform using the imputed values, that you don't have here (you just have the multiple imputations in a vector).

Conversely, BetaML supports random forests that in my (limited) experience perform a better job than pmm for real datasets on which I erased (at random) some data and then checked the quality of the imputation.

@ParadaCarleton
Copy link
Author

Compared to the imputers in BetaML it provides pooling of the analysis you perform using the imputed values, that you don't have here (you just have the multiple imputations in a vector).

As in, BetaML just performs one imputation per missing data point, by randomly sampling a possible imputed value?

@sylvaticus
Copy link
Owner

sylvaticus commented Nov 13, 2023

As in, BetaML just performs one imputation per missing data point, by randomly sampling a possible imputed value?

No. Let's consider some tabular data with records as N rows and dimensions as C cols.
BetaML, for each imputation, builds C supervised models of c as a function of c-complement cols and then uses these models to predict the missing values.
There is no "sampling" of the missing values. Each imputation is an independent set of models and relative predictions, and the output is a vector of the imputed tables. What distinguishes each imputation is the randomness specific to each supervised model. For example, in random forests it is given by the records used to train the individual decision tree and the subset of dimension employed for that tree, for a neural network estimator it would be the initial weights of the deep layers, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants