Advice: creating unique identifiers in a 'sparse' dataset #2482

pierpaolocreanza · 2024-10-25T21:01:57Z

pierpaolocreanza
Oct 25, 2024

Hi all, I'm new to entity resolution (particularly in splink) and struggling to find the most sensible approach for my case.

I need to create unique person ids in large historical dataset, at the output-person level. I have these people's full names, plus a number of other characteristics like year, location, employers etc.

Given the nature of my data, I expect a large fraction of the people in my dataset to appear only once: maybe 40-60% of people only have one observation / piece of output (e.g. most artists only have one work of art--not my subject matter, but just to be clear). Most other people will have a few observations (2-10), and a minority will have a ton (100-1000).

The most important information here is thus name: even though I have other data, I can expect the individuals with most observations to move around in space, change employers and span several years.

The first difficulty I'm encountering is weird parameter estimates. See below. Weights look okay, but the m probability on exact name matches seems too low (and on all else name comparison, too high).

The second related difficulty is what the resulting clusters look like. I find both people with long careers (and clearly the same name) being split up in different clusters (not immediately clear along which dimension), and clusters containing people with markedly different names.

Intuitively, I think the optimal approach would be that whenever the name is identical or very similar, the prior is that it is the same person; and then if other information differs substantially (e.g. same person reportedly in two far away locations in the same year) then we revise this down.

What shouldn't happen, but ends up occurring, is people with entirely different names ending up in the same cluster!

Any advice for how to make progress here?

TL-DR: splink ends up disregarding name similarity in clustering individuals. how to change this?

RobinL · 2024-10-26T07:18:27Z

RobinL
Oct 26, 2024
Maintainer

Are you able to share the script you're using to train the model? It'd also be useful if you could share a couple of example records (feel free to share fake records if the data is sensitive, just so I can get a sense of what the data looks Iike)

0 replies

pierpaolocreanza · 2024-10-30T14:41:59Z

pierpaolocreanza
Oct 30, 2024
Author

Hi Robin, thank you and apologies for a late reply. Sure!

These are my comparisons:


import splink.comparison_library as cl
from splink.comparison_level_library import *

# 1. Name Comparisons
# Define custom name comparison with Jaccard and Jaro conditions
name_comparison = cl.CustomComparison(
    comparison_levels=[
        # Null level for handling missing values
        NullLevel("inv_name_clean_final"),

        # Exact match for names
        ExactMatchLevel("inv_name_clean_final"),

        # Custom level for Jaccard or Jaro similarity >= 0.9
        CustomLevel(
            sql_condition="""
                JACCARD(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.9
                OR JARO_SIMILARITY(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.9
            """,
            label_for_charts="Jaccard or Jaro >= 0.9"
        ),

        # Custom level for Jaccard or Jaro similarity >= 0.8
        CustomLevel(
            sql_condition="""
                JACCARD(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.8
                OR JARO_SIMILARITY(inv_name_clean_final_l, inv_name_clean_final_r) >= 0.8
            """,
            label_for_charts="Jaccard or Jaro >= 0.8"
        ),

        # Fallback for all other cases
        ElseLevel()
    ],
    output_column_name="name_comparison",
    comparison_description="Inventor name comparison with token reordering and typo tolerance using Jaccard and Jaro"
)

# 2. CPC Class Comparisons
cpc_comparison = cl.CustomComparison(
    comparison_levels=[
        # Null
        NullLevel("cpc_subclass_array"),
        NullLevel("cpc_class_array"),
        NullLevel("cpc_section_array"),
        
        # Subclass-level comparison (most specific and informative)
        ArrayIntersectLevel("cpc_subclass_array", min_intersection=1),  # Non-zero overlap

        # Class-level comparison (medium informative)
        ArrayIntersectLevel("cpc_class_array", min_intersection=1),  # Non-zero overlap

        # Section-level comparison (less informative)
        ArrayIntersectLevel("cpc_section_array", min_intersection=1),  # Non-zero overlap

        # Else Level - no overlap
        ElseLevel()
        
    ],
    comparison_description="CPC class hierarchy comparison without scores"
)

# 3. Assignee Comparisons
assignee_comparison = cl.CustomComparison(
    comparison_levels=[
        # Null
        NullLevel("assignee_ids"),
        NullLevel("unique_assignee_name"),
        
        # Assignee IDs: Non-zero overlap
        ArrayIntersectLevel("assignee_ids", min_intersection=1),

        # Custom level for Jaccard or Jaro similarity >= 0.85
        CustomLevel(
            sql_condition="""
                JACCARD(unique_assignee_name_l, unique_assignee_name_r) >= 0.85
                OR JARO_SIMILARITY(unique_assignee_name_l, unique_assignee_name_r) >= 0.85
            """,
            label_for_charts="Jaccard or Jaro >= 0.85"
        ),
        
        # Fallback
        ElseLevel()
    ],
    output_column_name="assignee_comparison",
    comparison_description="Assignee comparison using IDs and unique names"
)

# 4. Inventor-to-Inventor Distance Comparison
inv_to_inv_distance_comparison = cl.DistanceInKMAtThresholds("inv_lat", "inv_long", [30, 150, 750])

# 5. Inventor-to-Assignee Distance Comparison using dist_to_asg
inv_to_asg_distance_comparison = cl.CustomComparison(
    comparison_levels=[
        NullLevel("dist_to_asg"),
        AbsoluteDifferenceLevel("dist_to_asg", 10),
        AbsoluteDifferenceLevel("dist_to_asg", 50),
        AbsoluteDifferenceLevel("dist_to_asg", 500),
        ElseLevel()
    ],
    output_column_name="inv_to_asg_distance_comparison",
    comparison_description="Distance between inventors and their respective assignees in km"
)

# 6. Year of issue
iyear_comparison = cl.CustomComparison(
    comparison_levels=[
        NullLevel("iyear"),
        AbsoluteDifferenceLevel("iyear", 10),
        AbsoluteDifferenceLevel("iyear", 25),
        AbsoluteDifferenceLevel("iyear", 50),
        ElseLevel()
    ],
    comparison_description="Year of issue comparison using absolute difference"
)

and these are my training steps (I have tried slight variations of this):

# Create Splink settings
settings = SettingsCreator(
    link_type="dedupe_only",
    blocking_rules_to_generate_predictions=blocking_rules,
    comparisons = [
        name_comparison,
        cpc_comparison,
        assignee_comparison,
        inv_to_inv_distance_comparison,
        inv_to_asg_distance_comparison,
        iyear_comparison]
)

# Initialize the linker
test_linker = Linker(df, settings, db_api=db_api)

deterministic_rules = [
    block_on("inv_name_stripped", "iyear_decade","inv_state","assignee_ids")
]
test_linker.training.estimate_probability_two_random_records_match(
    deterministic_rules, recall=0.6
)

test_linker.training.estimate_u_using_random_sampling(max_pairs=1e9) #set to e9 for full sample

# First EM session
br_training1 = block_on("inv_fips","iyear_decade")
test_linker.training.estimate_parameters_using_expectation_maximisation(br_training1)

# Second EM session
br_training2 = block_on("inv_name_clean_final")
test_linker.training.estimate_parameters_using_expectation_maximisation(br_training2)

And a couple of entries (from slightly updated dataset)

1 reply

RobinL Nov 6, 2024
Maintainer

I'm not sure but one thing I'd definitely do is split the name across columns (we typically go for columns like name1, name2, surname)

Similarity functions work less well on full names (i.e. multiple tokens with spaces).

Maybe start with that and see if you get more sensible m values.

something like:

import duckdb

# Create a test database in DuckDB
con = duckdb.connect()

# Create a table with example data
con.execute("""
CREATE TABLE names (first_name VARCHAR, middle_names VARCHAR, last_name VARCHAR);
""")
con.execute("""
INSERT INTO names VALUES ('john', 'd', 'smith');
""")

# Define and execute the SQL query to split the name into separate columns
result = con.execute(f"""
WITH name_split AS (
    SELECT
        first_name,
        middle_names,
        last_name,
        CONCAT_WS(' ', first_name, middle_names, last_name) AS full_name,
        
        -- Clean punctuation and split to array
        REGEXP_SPLIT_TO_ARRAY(
            REPLACE(REPLACE(CONCAT_WS(' ', first_name, middle_names, last_name), '-', ' '), '''', ''), 
            '\\s+'
        ) AS names_split
    FROM names
)
SELECT
    full_name,
    
    -- Extract standardized names
    names_split[1] AS name_1_std,
    CASE 
        WHEN array_length(names_split) > 2 THEN names_split[2]
        ELSE NULL 
    END AS name_2_std,
    CASE 
        WHEN array_length(names_split) > 3 THEN names_split[3]
        ELSE NULL 
    END AS name_3_std,
    CASE 
        WHEN array_length(names_split) > 1 THEN names_split[array_length(names_split)]
        ELSE NULL 
    END AS last_name_std
FROM name_split;
""").fetchdf()

print(result)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice: creating unique identifiers in a 'sparse' dataset #2482

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Advice: creating unique identifiers in a 'sparse' dataset #2482

pierpaolocreanza Oct 25, 2024

Replies: 2 comments · 1 reply

RobinL Oct 26, 2024 Maintainer

pierpaolocreanza Oct 30, 2024 Author

RobinL Nov 6, 2024 Maintainer

pierpaolocreanza
Oct 25, 2024

Replies: 2 comments 1 reply

RobinL
Oct 26, 2024
Maintainer

pierpaolocreanza
Oct 30, 2024
Author

RobinL Nov 6, 2024
Maintainer