Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add exact match columns constraint on Joiner #1113

Open
Vincent-Maladiere opened this issue Oct 16, 2024 · 4 comments
Open

[ENH] Add exact match columns constraint on Joiner #1113

Vincent-Maladiere opened this issue Oct 16, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@Vincent-Maladiere
Copy link
Member

Vincent-Maladiere commented Oct 16, 2024

Problem Description

Some applications call for a partially fuzzy join, meaning fuzzy joining within groups of exactly matched entities.

For instance, matching loans from two tables of users having multiple loans, when there is no loan_id. In this scenario, constraining the fuzzy join on loans belonging to the same users (having a user_id) would make sense. Within these groups, we would next perform fuzzy joining on loan prices and loan creation dates, for example.

Feature Description

We could have multiple strategies to use constraints and units that have a business meaning:

  1. The user could pass a custom distance function to define the weights between loan price distance and creation date distance.
  2. The user could indicate a single column to minimize while constraining other columns to some threshold distance. For instance, minimizing the price distance while keeping the date distances within a range of one day.

Alternative Solutions

No response

Additional Context

Fuzzy joining different columns on the same l2 space currently limits the application of Joiner and fuzzy_join to tangible use cases.

@Vincent-Maladiere Vincent-Maladiere added the enhancement New feature or request label Oct 16, 2024
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 16, 2024 via email

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Oct 16, 2024

I'm not sure that exact matching is what you are looking for.

I guess it is when IDs must match exactly before performing fuzzy join, right? In a scenario where joining on an ID that is close but different would be a mistake.

@jeromedockes
Copy link
Member

could there also be situations where this helps narrow down the nearest neighbor search and thus reduce computation & memory? in the example you give above we would only compute pairwise distances between loans of a given user, not of all users

@GaelVaroquaux
Copy link
Member

Good point!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants