Clustering is dropping input nodes #2457
-
I know that So for example, I have some match predictions that look like this. There are three source datasets, and as you can see all the records from source "e" and source "l" are matched to a single record from source "o" with the exact same match probability. The prediction threshold to generate this was very low, something like 0.1, and the prediction table was saved as a parquet file. When I register the prediction table to the DuckDB connection and run I can't figure out why this would be happening — it seems to contradict everything I have read about the clustering function — and would really appreciate any help! And for additional clarification, my code basically looks like this: # db.duckdb contains the raw tables that I linked
con = duckdb.connect('db.duckdb')
linker = Linker(
['e', 'l', 'o'],
settings = 'model.json',
db_api = DuckDBAPI(con)
)
predictions_pd = pd.read_parquet('predictions.parquet')
predictions_sdf = linker.table_management.register_table_predict(predictions_pd, overwrite = True)
clusters = linker.clustering.cluster_pairwise_predictions_at_threshold(predictions_sdf, threshold_match_probability = 0.15) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
FWIW, I should also mention I've noticed some additional weird behavior in this particular case. Regardless of the threshold I use for clustering (below the match probability of 0.98), and regardless of whether there are any nodes from source 'e' in the resulting cluster table, the |
Beta Was this translation helpful? Give feedback.
You're not doing anything obviously wrong and you're right that the number of output rows should be equal to the number of input rows.
My suspicion is there's something about the format of your input data (the tables o,l,e, or the predictions) which is not quite right. Possibly some IDs are not unique, or there are predictions without a corresponding node
This is a bit difficult for us to debug without a reprex but there are a few things you can try - if I have time I will try to create one from the tables you've pasted. [EDIT] here's an example which seems to work correctly:
This seems to work, click to expand