-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run JOSIE to find joinable tables (columns) given table csv? #3
Comments
This repo is more for reproducibility in academic settings. If you are
interested in building a real application maybe you can take a look at:
1. ekzhu/SetSimilaritySearch: All-pair set similarity search on millions of
sets in Python and on a laptop (github.com)
<https://github.com/ekzhu/setsimilaritysearch>
2. MinHash LSH — datasketch 1.5.9 documentation (ekzhu.com)
<https://ekzhu.com/datasketch/lsh.html>
None of the above implements JOSIE but should be good enough depending on
your use case.
…On Sat, Apr 8, 2023 at 5:53 PM v4ray ***@***.***> wrote:
Hi, this is a great work! I am trying to experiment with JOSIE to find
joinable tables and unsure about the data pipeline. Could you briefly
explain how to use this JOSIE codebase to find joinable tables given a
query column, if the input data are several raw csv files representing
tables?
This code base seems to depend on postgres dump files representing tables.
Is it necessary to generate these dump files for the above purpose and if
so how to do it?
Thank you!
—
Reply to this email directly, view it on GitHub
<#3>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACOGLUVJTDDOVIVMKTMIMLXAIB7NANCNFSM6AAAAAAWXYVUVA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I recommend starting with MinHashLSH for finding joinbale tables. You first create MinHash for every column. Then you index all the MinHash in an MinHashLSH index. After that you can query the index for columns with high Jaccard similarity. |
thanks |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi, this is a great work! I am trying to experiment with JOSIE to find joinable tables and unsure about the data pipeline. Could you briefly explain how to use this JOSIE codebase to find joinable tables given a query column, if the input data are several raw csv files (another dataset) representing tables?
This code base seems to depend on postgres dump files representing tables. Is it necessary to generate these dump files for the above purpose and if so how to do it?
Thank you!
The text was updated successfully, but these errors were encountered: