-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add intersection and similarity #30
Comments
These features seem fine though I will need to more research on implementations for computing intersection cardinality. |
Seems like there is no need for special implementations for computing intersection cardinality
For |
HyperLogLogs do not store the set, which is why they are memory efficient, so you cannot compute expressions like Substituting the registers of the HyperLogLog for these expressions (e.g. For example:
Suppose |
Wait, then what |
Yes but this isn't done using a union operation. Instead there is a register by register comparison and the maximum value is the winner. Then the cardinality estimation of the merged HyperLogLog approximates the true cardinality of the union of the sets. |
I should add that because you take the maximum value estimating the cardinality of a union is straightforward since the algorithm only needs the maximum values. However for intersections the maximum value doesn't give you sufficient information to accurately and reliably estimate the intersection cardinality. Of course, this is an area of active research so that could change in the future. |
Knowing cardinality of sets union you can find cardinality of sets intersection. There is a formula in math:
For |
And what happens when each estimate is 2% off in the same direction? The total error is then approximately 6% which is 3 times the error you normally get. Now extending this example, what happens when you use this approach on 1000 sets? Even if the error is low it is easy to see how the error quickly compounds to produce inaccurate results. This was why my original preference was to "wait and see" and let people implement intersection estimates themselves. In the case of two sets, using inclusion-exclusion principle, it is trivial to do:
All this being said, you're not the first person to ask for an intersection feature. Maybe I should re-consider my original stance since "perfect is the enemy of good." If we used inclusion-exclusion, I would need to put a disclaimer in the README regarding the accuracy of this method. |
Oh, yes good point, did not consider nonoptimal error estimate |
Hello All, I think the answer is not to use that math equation above (also in the Dashing paper) for Jaccard index, which is know to have much larger variation than MinHash. the joint MLE estimator for cardinality in the Ertl 2017 paper is the most accurate, a well written paper with example code in c++, a C version can be derived from my perspective. FYI the Ultraloglog (https://arxiv.org/abs/2308.16862) is out with new FGRA estimator, even more space efficient while more accurate than the improved estimator and raw estimator, the closest one to the Camera-Rao lower bound (MLE method meets the lower bound but very slow, see paper for details). Should be very useful if we have a C-python version to use. The best for set similarity is MinHash (more recent is the optimal densification MinHash), the much smaller variance than whatever HLL based Jaccard, limited by the sqrt(1.079/m) RMSE bound. Thanks, Jianshu |
@jianshu93 Thanks for bringing to my attention this paper. I gave it a quick skim and it looks promising. I think it would be best implemented as a separate class (e.g. |
First of all, this library is great. Thank you!
I wanted to suggest two features found in some other HLL libraries that (as far as I can tell) are missing here. It is possible to compute an estimated cardinality of the intersection between two HLLs and an estimated similarity between two HLLs.
Here is a Go library that focuses on these two features: https://github.com/axiomhq/hyperminhash
It would be awesome to have C implementations of these operations exposed to python as part of the HyperLogLog object.
The text was updated successfully, but these errors were encountered: