Question about similarity of multiple sketches #13

ocadaruma · 2019-07-11T07:14:24Z

First of all, thanks a lot for your useful, nice and interesting Java implementation of HyperMinHash.

My question is about this: https://github.com/LiveRamp/HyperMinHash-java/blob/master/src/main/java/com/liveramp/hyperminhash/BetaMinHashCombiner.java#L54

As I read arXiv:1710.08436, while mergeability is trivial, I think it's not trivial that Jaccard Index estimation for multiple (> 2) sets works properly.

Does the estimation still have same accuracy as of 2-set Jaccard Index ?
If so, is there any proof ?

Sorry for obscure question. Thanks again.

The text was updated successfully, but these errors were encountered:

sherifnada · 2019-07-11T18:24:09Z

@ocadaruma Thanks for your question! It's certainly a very good one.

Does the estimation still have same accuracy as of 2-set Jaccard Index ?

While we haven't written up a mathematical proof, we've done a lot of testing internally at LiveRamp and empirically observed that Jaccard index estimation for multiple sets works just as well as for pairs of sets.

If so, is there any proof ?

I have some numbers on this that I'd be more than happy to share once I get some time to polish up the findings. I'll close this issue with that accuracy report and corresponding code.

This being said, you might find this blog post and the accompanying whitepaper helpful in that AdRoll also tested the same idea (but using MinHash and HLL separately instead of HyperMinHash) and both seem to imply the same results we found with our internal testing, although they do not explicitly discuss accuracy of multiple sets.

yunwilliamyu · 2019-07-11T19:06:09Z

Hi there! I'm the author of arXiv:1710.08436. The short answer is that it does indeed still work, for the same reason that MinHash alone allows you to do multiway Jaccard.

The long answer is that the additional error correction we do (i.e. computing expected collisions) does not work as written for >2 sets. It's not particularly difficult to write up an algorithm for that (though it is nontrivial), but we didn't bother. The reason we didn't bother is that when you have >2 sets, the chance that there will be an additional collision introduced by HyperMinHash compression drops rapidly. Thus, the correction factor becomes less necessary. And, of course, so long as the number of additional collisions from using HyperMinHash is close to zero, HyperMinHash behaves exactly the same as MinHash for doing multiway Jaccard.

-Yun William Yu

sherifnada · 2019-07-11T20:10:11Z

Thanks @yunwilliamyu !

joshk0 · 2019-07-11T21:24:25Z

This is great, could it be incorporated in some FAQ.md document?

ocadaruma · 2019-07-12T15:38:57Z

@sherifnada Great to hear that.

empirically observed that Jaccard index estimation for multiple sets works just as well as for pairs of sets.

That would be very helpful. I'm looking forward to that. Thanks.

I'll close this issue with that accuracy report and corresponding code.

And I will check AdRoll's paper as well.

@yunwilliamyu Thank you for detailed explanation ! I understood.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about similarity of multiple sketches #13

Question about similarity of multiple sketches #13

ocadaruma commented Jul 11, 2019

sherifnada commented Jul 11, 2019 •

edited

Loading

yunwilliamyu commented Jul 11, 2019

sherifnada commented Jul 11, 2019

joshk0 commented Jul 11, 2019

ocadaruma commented Jul 12, 2019

Question about similarity of multiple sketches #13

Question about similarity of multiple sketches #13

Comments

ocadaruma commented Jul 11, 2019

sherifnada commented Jul 11, 2019 • edited Loading

yunwilliamyu commented Jul 11, 2019

sherifnada commented Jul 11, 2019

joshk0 commented Jul 11, 2019

ocadaruma commented Jul 12, 2019

sherifnada commented Jul 11, 2019 •

edited

Loading