Advice: creating unique identifiers in a 'sparse' dataset #2482
Unanswered
pierpaolocreanza
asked this question in
Q&A
Replies: 2 comments 1 reply
-
Are you able to share the script you're using to train the model? It'd also be useful if you could share a couple of example records (feel free to share fake records if the data is sensitive, just so I can get a sense of what the data looks Iike) |
Beta Was this translation helpful? Give feedback.
0 replies
-
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all, I'm new to entity resolution (particularly in splink) and struggling to find the most sensible approach for my case.
I need to create unique person ids in large historical dataset, at the output-person level. I have these people's full names, plus a number of other characteristics like year, location, employers etc.
Given the nature of my data, I expect a large fraction of the people in my dataset to appear only once: maybe 40-60% of people only have one observation / piece of output (e.g. most artists only have one work of art--not my subject matter, but just to be clear). Most other people will have a few observations (2-10), and a minority will have a ton (100-1000).
The most important information here is thus name: even though I have other data, I can expect the individuals with most observations to move around in space, change employers and span several years.
The first difficulty I'm encountering is weird parameter estimates. See below. Weights look okay, but the m probability on exact name matches seems too low (and on all else name comparison, too high).
The second related difficulty is what the resulting clusters look like. I find both people with long careers (and clearly the same name) being split up in different clusters (not immediately clear along which dimension), and clusters containing people with markedly different names.
Intuitively, I think the optimal approach would be that whenever the name is identical or very similar, the prior is that it is the same person; and then if other information differs substantially (e.g. same person reportedly in two far away locations in the same year) then we revise this down.
What shouldn't happen, but ends up occurring, is people with entirely different names ending up in the same cluster!
Any advice for how to make progress here?
TL-DR: splink ends up disregarding name similarity in clustering individuals. how to change this?
Beta Was this translation helpful? Give feedback.
All reactions