Clustering is dropping input nodes #2457

AdamChapnik1 · 2024-10-07T16:39:14Z

AdamChapnik1
Oct 7, 2024

I know that cluster_pairwise_preds_at_threshold is supposed to be the same length as the input nodes, but I've noticed this isn't always the case. If I have saved the prediction file and run the clustering using the code recommended here, when the threshold for clustering is above the threshold for prediction then the resulting clustered table seems to drop some nodes (including some nodes whose edge weights are above the threshold).

So for example, I have some match predictions that look like this. There are three source datasets, and as you can see all the records from source "e" and source "l" are matched to a single record from source "o" with the exact same match probability. The prediction threshold to generate this was very low, something like 0.1, and the prediction table was saved as a parquet file.

When I register the prediction table to the DuckDB connection and run cluster_pairwise_preds_at_threshold, all of these nodes are included in the same cluster when the threshold is placed at 0.1 or even 0.125. But when I increase the threshold to 0.75 or even 0.2 (still below the edge weights), all of the records from source "e" are dropped from the resulting node list. They are not even included with a different cluster_id. I have also noticed that, regardless of the threshold, there are no clusters in the node list with only a single node, indicating that cluster_pairwise_preds_at_threshold seems to not be using any of the raw input data at all (since the nodes in clusters with a single node would be records that were not linked to any other records and therefore not in the match predictions table).

I can't figure out why this would be happening — it seems to contradict everything I have read about the clustering function — and would really appreciate any help!

And for additional clarification, my code basically looks like this:

# db.duckdb contains the raw tables that I linked
con = duckdb.connect('db.duckdb')

linker = Linker(
    ['e', 'l', 'o'],
    settings = 'model.json',
    db_api = DuckDBAPI(con)
               )

predictions_pd = pd.read_parquet('predictions.parquet') 
predictions_sdf = linker.table_management.register_table_predict(predictions_pd, overwrite = True)

clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(predictions_sdf, threshold_match_probability = 0.15)

Answered by RobinL

Oct 7, 2024

You're not doing anything obviously wrong and you're right that the number of output rows should be equal to the number of input rows.

My suspicion is there's something about the format of your input data (the tables o,l,e, or the predictions) which is not quite right. Possibly some IDs are not unique, or there are predictions without a corresponding node

This is a bit difficult for us to debug without a reprex but there are a few things you can try - if I have time I will try to create one from the tables you've pasted. [EDIT] here's an example which seems to work correctly:

This seems to work, click to expand

import duckdb
import pandas as pd
from splink import Linker, DuckDBAPI, Setti…

View full answer

AdamChapnik1 · 2024-10-07T16:50:23Z

AdamChapnik1
Oct 7, 2024
Author

FWIW, I should also mention I've noticed some additional weird behavior in this particular case. Regardless of the threshold I use for clustering (below the match probability of 0.98), and regardless of whether there are any nodes from source 'e' in the resulting cluster table, the cluster_id that is getting assigned to this particular cluster with the record from source 'o' above is always e-__-17446182. That is very odd behavior and seems to imply that the node with addid = 17446182 from source 'e' is being included in the cluster, as it should be, and then being dropped before returning the cluster table. I don't know how to interpret this but maybe it points to whether this is a bug or I'm doing something wrong.

3 replies

RobinL Oct 7, 2024
Maintainer

You're not doing anything obviously wrong and you're right that the number of output rows should be equal to the number of input rows.

My suspicion is there's something about the format of your input data (the tables o,l,e, or the predictions) which is not quite right. Possibly some IDs are not unique, or there are predictions without a corresponding node

This is a bit difficult for us to debug without a reprex but there are a few things you can try - if I have time I will try to create one from the tables you've pasted. [EDIT] here's an example which seems to work correctly:

This seems to work, click to expand

import duckdb
import pandas as pd
from splink import Linker, DuckDBAPI, SettingsCreator

# fmt: off
data1 = [
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45327171, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45326851, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45356970, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45292037, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45282974, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45340223, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45296117, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45279354, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45338333, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45327653, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45354398, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45321167, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45346173, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45298365, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45312593, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45354010, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45351030, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45309395, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45292918, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "l", "source_dataset_r": "o", "addid_l": 45359252, "addid_r": 178930885, "match_key": 8}
]
# fmt: on
df1 = pd.DataFrame(data1)



# fmt: off
data2 = [
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17529674, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17537425, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17458376, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17448874, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17529667, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17618791, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17447769, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17533354, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17446182, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17618789, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17574671, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17537434, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17543103, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17577089, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17545885, "addid_r": 178930885, "match_key": 8},
    {"match_weight": 5.623113885106846, "match_probability": 0.980113859752097, "source_dataset_l": "e", "source_dataset_r": "o", "addid_l": 17582332, "addid_r": 178930885, "match_key": 8}
]
# fmt: on
df2 = pd.DataFrame(data2)


# Create a table called o that contains the addid_r from df2
# -- UNION to ensure they're distinct
sql = """
select addid_r as addid, 'o' as source_dataset from df1
UNION
select addid_r as addid, 'o' as source_dataset from df2
"""
o = duckdb.sql(sql)


sql = """
select addid_l as addid, 'l' as source_dataset from df1
"""
l = duckdb.sql(sql)


sql = """
select addid_l as addid, 'e' as source_dataset from df2
"""
e = duckdb.sql(sql)

sql = """
select * from df1
UNION ALL
select * from df2
"""
df_predictions = duckdb.sql(sql)


settings = SettingsCreator(link_type='link_only', unique_id_column_name='addid', source_dataset_column_name='source_dataset')
con = DuckDBAPI(":default:")
linker = Linker([o,l,e], settings, con)
df_predictions_sdf = linker.table_management.register_table_predict(df_predictions)

res = linker.clustering.cluster_pairwise_predictions_at_threshold(df_predictions_sdf, threshold_match_probability=0.5)
res.as_duckdbpyrelation()


sql = """
with all_rows as (
select * from o
UNION ALL
select * from e
UNION ALL
select * from l
)
select count(*) from all_rows
"""

count_all_rows = duckdb.sql(sql).fetchone()[0]
count_nodes = res.as_duckdbpyrelation().count("*").fetchone()[0]
assert count_all_rows == count_nodes

res = linker.clustering.cluster_pairwise_predictions_at_threshold(df_predictions_sdf, threshold_match_probability=0.99)
res.as_duckdbpyrelation()

In 4.0.3 we added clustering without a linker.

You could try using that function to see if you get a different result. It uses the same underlying code/algorithm but it's a bit more of a 'pure' test

Another important thing to look at is to ensure that your IDs are definitely unique prior to clustering. Note that they should be globally unique across all input dataframes. (the prefixes like e-__- should be ensuring that when using the linker)

from duckdb import DuckDBPyRelation

import splink.comparison_library as cl
from splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets
from splink.internals.clustering import cluster_pairwise_predictions_at_threshold

db_api = DuckDBAPI()

nodes = [
    {"my_id": 1},
    {"my_id": 2},
    {"my_id": 3},
    {"my_id": 4},
    {"my_id": 5},
    {"my_id": 6},
]

edges = [
    {"n_1": 1, "n_2": 2, "match_probability": 0.8},
    {"n_1": 3, "n_2": 2, "match_probability": 0.9},
    {"n_1": 4, "n_2": 5, "match_probability": 0.99},
]

cluster_pairwise_predictions_at_threshold(
    nodes,
    edges,
    node_id_column_name="my_id",
    edge_id_column_name_left="n_1",
    edge_id_column_name_right="n_2",
    db_api=db_api,
    threshold_match_probability=0.5,
).as_pandas_dataframe()

nodes = [
    {"abc": 1},
    {"abc": 2},
    {"abc": 3},
    {"abc": 4},
]

edges = [
    {"abc_l": 1, "abc_r": 2, "match_probability": 0.8},
    {"abc_l": 3, "abc_r": 2, "match_probability": 0.9},
]

cluster_pairwise_predictions_at_threshold(
    nodes,
    edges,
    node_id_column_name="abc",
    db_api=db_api,
    threshold_match_probability=0.5,
).as_pandas_dataframe()

Answer selected by AdamChapnik1

AdamChapnik1 Oct 9, 2024
Author

Thanks for the help once again! It ended up that I made a basic coding error that mixed up the alignment of one set of IDs between the node list and edge list, but your explanation of how the backend for the clustering algorithm works really helped with debugging. Also very happy to see a new algorithm for clustering without a linker. Also, sorry about not providing a reprex, but I'm now very curious — how did you extract the values from the table images I included?

RobinL Oct 9, 2024
Maintainer

Pasted image into chatgpt and asked for 'this data as a list of python dicts'!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering is dropping input nodes #2457

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Clustering is dropping input nodes #2457

AdamChapnik1 Oct 7, 2024

Replies: 1 comment · 3 replies

AdamChapnik1 Oct 7, 2024 Author

RobinL Oct 7, 2024 Maintainer

AdamChapnik1 Oct 9, 2024 Author

RobinL Oct 9, 2024 Maintainer

AdamChapnik1
Oct 7, 2024

Replies: 1 comment 3 replies

AdamChapnik1
Oct 7, 2024
Author

RobinL Oct 7, 2024
Maintainer

AdamChapnik1 Oct 9, 2024
Author

RobinL Oct 9, 2024
Maintainer