-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about similarity of multiple sketches #13
Comments
@ocadaruma Thanks for your question! It's certainly a very good one.
While we haven't written up a mathematical proof, we've done a lot of testing internally at LiveRamp and empirically observed that Jaccard index estimation for multiple sets works just as well as for pairs of sets.
I have some numbers on this that I'd be more than happy to share once I get some time to polish up the findings. I'll close this issue with that accuracy report and corresponding code. This being said, you might find this blog post and the accompanying whitepaper helpful in that AdRoll also tested the same idea (but using MinHash and HLL separately instead of HyperMinHash) and both seem to imply the same results we found with our internal testing, although they do not explicitly discuss accuracy of multiple sets. |
Hi there! I'm the author of arXiv:1710.08436. The short answer is that it does indeed still work, for the same reason that MinHash alone allows you to do multiway Jaccard. The long answer is that the additional error correction we do (i.e. computing expected collisions) does not work as written for >2 sets. It's not particularly difficult to write up an algorithm for that (though it is nontrivial), but we didn't bother. The reason we didn't bother is that when you have >2 sets, the chance that there will be an additional collision introduced by HyperMinHash compression drops rapidly. Thus, the correction factor becomes less necessary. And, of course, so long as the number of additional collisions from using HyperMinHash is close to zero, HyperMinHash behaves exactly the same as MinHash for doing multiway Jaccard. -Yun William Yu |
Thanks @yunwilliamyu ! |
This is great, could it be incorporated in some FAQ.md document? |
@sherifnada Great to hear that.
That would be very helpful. I'm looking forward to that. Thanks.
And I will check AdRoll's paper as well. @yunwilliamyu Thank you for detailed explanation ! I understood. |
First of all, thanks a lot for your useful, nice and interesting Java implementation of HyperMinHash.
My question is about this: https://github.com/LiveRamp/HyperMinHash-java/blob/master/src/main/java/com/liveramp/hyperminhash/BetaMinHashCombiner.java#L54
As I read
arXiv:1710.08436
, while mergeability is trivial, I think it's not trivial that Jaccard Index estimation for multiple (> 2) sets works properly.Sorry for obscure question. Thanks again.
The text was updated successfully, but these errors were encountered: