Advice for improving m-probabilities using EM algorithm #2512
Replies: 2 comments 1 reply
-
This is fairly common and generally fixable by either modifying the training approach (the blocking rules provided to With data cleaning, names can be a particular issue and we generally 're-parse them' by concatenating the full name and then resplitting across columns (name1, name2 etc) using a consistent approach. I can probably dig out some code if useful I'd be useful if you could share your training code and a couple of example records (feel free to fake them) - then perhaps we can give some pointers. |
Beta Was this translation helpful? Give feedback.
-
Just adding a comment here to outline the changes I made to improve m-probability estimation, following Robin's advice above, in case someone else comes along with the same issue. For me the improvement came from updating the blocking rules used by linker.estimate_parameters_using_expectation_maximisation(). My observation is that once an m-prob has been estimated it is not updated by further execution of linker.estimate_parameters_using_expectation_maximisation(). My updated strategy is to identify the matching variables that are useful as blocking variables for linker.estimate_parameters_using_expectation_maximisation(). These are variables with low missingness and high numbers of distinct values. My first linker.estimate_parameters_using_expectation_maximisation() run uses all of the identified variables, generating m-prob estimates for all of the other match variables. I then drop out one useful blocking variable at a time to generate an m-prob estimate for that match variable. This produced satisfactory results for my case. Very interested in feedback from experts if my strategy has flaws! |
Beta Was this translation helpful? Give feedback.
-
Hello Splink Community,
I have results from alternative de-duplication software, which meet our requirements for accuracy, but not explainability. We are looking at Splink because it is highly transparent (once the model is trained it is clear for consumers how model scores are generated through, for example, the waterfall charts).
When I estimate m-probabilities using estimate_m_from_label_column and results from the alternative software the Splink results are great.
When I estimate m-probabilities using estimate_parameters_using_expectation_maximisation the results are not accurate enough for use, and the match_weights_chart and m_u_parameters_chart make less sense.
Does anyone have any advice about how to tune the EM algorithm to produce better m-probabilities? What do others in the community do when trying to improve their m-probabilities when estimating using the EM algorithm?
Some context: My data set has approx 3M entities. I have Family name, 3 first names, date of birth, gender, e-mail, phone and street address. E-mail, phone and street address are arrays and I am using exact match array comparison for these. I am using exact match for gender and date of birth. I am using the ctl.name_comparison for names. I run 2 EM steps, one blocked on first name and family name, the other blocked on date of birth (following the tutorials without really understanding why).
I am very happy to provide more detail if that can be helpful.
Thank you so much for any advice!
Beta Was this translation helpful? Give feedback.
All reactions