Advice for improving m-probabilities using EM algorithm #2512

JonnyNZCustoms · 2024-11-19T03:47:46Z

JonnyNZCustoms
Nov 19, 2024

Hello Splink Community,

I have results from alternative de-duplication software, which meet our requirements for accuracy, but not explainability. We are looking at Splink because it is highly transparent (once the model is trained it is clear for consumers how model scores are generated through, for example, the waterfall charts).

When I estimate m-probabilities using estimate_m_from_label_column and results from the alternative software the Splink results are great.

When I estimate m-probabilities using estimate_parameters_using_expectation_maximisation the results are not accurate enough for use, and the match_weights_chart and m_u_parameters_chart make less sense.

Does anyone have any advice about how to tune the EM algorithm to produce better m-probabilities? What do others in the community do when trying to improve their m-probabilities when estimating using the EM algorithm?

Some context: My data set has approx 3M entities. I have Family name, 3 first names, date of birth, gender, e-mail, phone and street address. E-mail, phone and street address are arrays and I am using exact match array comparison for these. I am using exact match for gender and date of birth. I am using the ctl.name_comparison for names. I run 2 EM steps, one blocked on first name and family name, the other blocked on date of birth (following the tutorials without really understanding why).

I am very happy to provide more detail if that can be helpful.

Thank you so much for any advice!

RobinL · 2024-11-19T09:50:31Z

RobinL
Nov 19, 2024
Maintainer

This is fairly common and generally fixable by either modifying the training approach (the blocking rules provided to estimate_parameters_using_expectation_maximisation), fixing some m values, or improving data cleaning.

With data cleaning, names can be a particular issue and we generally 're-parse them' by concatenating the full name and then resplitting across columns (name1, name2 etc) using a consistent approach. I can probably dig out some code if useful

I'd be useful if you could share your training code and a couple of example records (feel free to fake them) - then perhaps we can give some pointers.

1 reply

JonnyNZCustoms Nov 20, 2024
Author

Thanks for your response Robin. It is much appreciated.

Unfortunately sharing our code or even made-up examples to such a public forum goes against our information management principles :( There is one piece of detail that I can add to what I wrote above. The key difference in the m probabilities between the supervised and E-M runs are for the 2 middle names. In the supervised run the "All other comparisons" m-probability is small. For the E-M run the "All other comparisons" m-probability is much larger, being about half of an exact match.

I will try your suggestion of concatenating and re-splitting the name components.

I will also try to modify the blocking rules provided to estimate_parameters_using_expectation_maximisation. I would be grateful for any guidance on strategies to do this!

Thanks again!

JonnyNZCustoms · 2024-11-26T19:21:01Z

JonnyNZCustoms
Nov 26, 2024
Author

Just adding a comment here to outline the changes I made to improve m-probability estimation, following Robin's advice above, in case someone else comes along with the same issue.

For me the improvement came from updating the blocking rules used by linker.estimate_parameters_using_expectation_maximisation(). My observation is that once an m-prob has been estimated it is not updated by further execution of linker.estimate_parameters_using_expectation_maximisation(). My updated strategy is to identify the matching variables that are useful as blocking variables for linker.estimate_parameters_using_expectation_maximisation(). These are variables with low missingness and high numbers of distinct values. My first linker.estimate_parameters_using_expectation_maximisation() run uses all of the identified variables, generating m-prob estimates for all of the other match variables. I then drop out one useful blocking variable at a time to generate an m-prob estimate for that match variable.

This produced satisfactory results for my case. Very interested in feedback from experts if my strategy has flaws!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice for improving m-probabilities using EM algorithm #2512

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Advice for improving m-probabilities using EM algorithm #2512

JonnyNZCustoms Nov 19, 2024

Replies: 2 comments · 1 reply

RobinL Nov 19, 2024 Maintainer

JonnyNZCustoms Nov 20, 2024 Author

JonnyNZCustoms Nov 26, 2024 Author

JonnyNZCustoms
Nov 19, 2024

Replies: 2 comments 1 reply

RobinL
Nov 19, 2024
Maintainer

JonnyNZCustoms Nov 20, 2024
Author

JonnyNZCustoms
Nov 26, 2024
Author