Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about similarity of multiple sketches #13

Open
ocadaruma opened this issue Jul 11, 2019 · 5 comments
Open

Question about similarity of multiple sketches #13

ocadaruma opened this issue Jul 11, 2019 · 5 comments

Comments

@ocadaruma
Copy link

First of all, thanks a lot for your useful, nice and interesting Java implementation of HyperMinHash.

My question is about this: https://github.com/LiveRamp/HyperMinHash-java/blob/master/src/main/java/com/liveramp/hyperminhash/BetaMinHashCombiner.java#L54

As I read arXiv:1710.08436, while mergeability is trivial, I think it's not trivial that Jaccard Index estimation for multiple (> 2) sets works properly.

  1. Does the estimation still have same accuracy as of 2-set Jaccard Index ?
  2. If so, is there any proof ?

Sorry for obscure question. Thanks again.

@sherifnada
Copy link
Contributor

sherifnada commented Jul 11, 2019

@ocadaruma Thanks for your question! It's certainly a very good one.

Does the estimation still have same accuracy as of 2-set Jaccard Index ?

While we haven't written up a mathematical proof, we've done a lot of testing internally at LiveRamp and empirically observed that Jaccard index estimation for multiple sets works just as well as for pairs of sets.

If so, is there any proof ?

I have some numbers on this that I'd be more than happy to share once I get some time to polish up the findings. I'll close this issue with that accuracy report and corresponding code.

This being said, you might find this blog post and the accompanying whitepaper helpful in that AdRoll also tested the same idea (but using MinHash and HLL separately instead of HyperMinHash) and both seem to imply the same results we found with our internal testing, although they do not explicitly discuss accuracy of multiple sets.

@yunwilliamyu
Copy link

Hi there! I'm the author of arXiv:1710.08436. The short answer is that it does indeed still work, for the same reason that MinHash alone allows you to do multiway Jaccard.

The long answer is that the additional error correction we do (i.e. computing expected collisions) does not work as written for >2 sets. It's not particularly difficult to write up an algorithm for that (though it is nontrivial), but we didn't bother. The reason we didn't bother is that when you have >2 sets, the chance that there will be an additional collision introduced by HyperMinHash compression drops rapidly. Thus, the correction factor becomes less necessary. And, of course, so long as the number of additional collisions from using HyperMinHash is close to zero, HyperMinHash behaves exactly the same as MinHash for doing multiway Jaccard.

-Yun William Yu

@sherifnada
Copy link
Contributor

Thanks @yunwilliamyu !

@joshk0
Copy link
Contributor

joshk0 commented Jul 11, 2019

This is great, could it be incorporated in some FAQ.md document?

@ocadaruma
Copy link
Author

@sherifnada Great to hear that.

empirically observed that Jaccard index estimation for multiple sets works just as well as for pairs of sets.

That would be very helpful. I'm looking forward to that. Thanks.

I'll close this issue with that accuracy report and corresponding code.

And I will check AdRoll's paper as well.

@yunwilliamyu Thank you for detailed explanation ! I understood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants