Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add intersection and similarity #30

Open
erp12 opened this issue Jun 4, 2019 · 11 comments
Open

Add intersection and similarity #30

erp12 opened this issue Jun 4, 2019 · 11 comments

Comments

@erp12
Copy link

erp12 commented Jun 4, 2019

First of all, this library is great. Thank you!

I wanted to suggest two features found in some other HLL libraries that (as far as I can tell) are missing here. It is possible to compute an estimated cardinality of the intersection between two HLLs and an estimated similarity between two HLLs.

Here is a Go library that focuses on these two features: https://github.com/axiomhq/hyperminhash

It would be awesome to have C implementations of these operations exposed to python as part of the HyperLogLog object.

hll1.intersection(hll2)
hll1.similarity(hll2)
@ascv
Copy link
Owner

ascv commented Jun 4, 2019

These features seem fine though I will need to more research on implementations for computing intersection cardinality.

@kuk
Copy link

kuk commented Dec 10, 2021

Seems like there is no need for special implementations for computing intersection cardinality

cardinality(A | B) = cardinality(A) + cardinality(B) - cardinality(A & B)
-> cardinality(A & B) = cardinality(A) + cardinality(B) - cardinality(A | B)

For cardinality(A | B) you have already implemented .merge method.

@ascv
Copy link
Owner

ascv commented Dec 10, 2021

@kuk

HyperLogLogs do not store the set, which is why they are memory efficient, so you cannot compute expressions like cardinality(A ∪ B) or cardinality(A ∩ B) with the data available.

Substituting the registers of the HyperLogLog for these expressions (e.g. cardinality(A ∪ B) --> cardinality(registers(A) ∪ registers(B))) doesn't work because the registers only store the maximum leading zero count observed. There is no way to know what values were previously observed which would be needed to compute an intersection or union of the sets.

For example:

Let H = a HyperLogLog 
Let A = {1, 2, ..., 2^32}
Let R = registers of H

Suppose k=4. Then max(cardinality(R)) = 2^4 since there are 2^4 registers which can at most store 2^4 distinct values. However cardinality(A) = 2^32 so it follows that cardinality(A) > max(cardinality(R)) and therefore cardinality(A) != cardinality(R).

@kuk
Copy link

kuk commented Dec 10, 2021

... cannot compute expressions like cardinality(A ∪ B) ...

Wait, then what .merge is for? I thought is it for computing cardinality(A ∪ B)

@ascv
Copy link
Owner

ascv commented Dec 10, 2021

Yes but this isn't done using a union operation. Instead there is a register by register comparison and the maximum value is the winner. Then the cardinality estimation of the merged HyperLogLog approximates the true cardinality of the union of the sets.

@ascv
Copy link
Owner

ascv commented Dec 10, 2021

I should add that because you take the maximum value estimating the cardinality of a union is straightforward since the algorithm only needs the maximum values. However for intersections the maximum value doesn't give you sufficient information to accurately and reliably estimate the intersection cardinality. Of course, this is an area of active research so that could change in the future.

@kuk
Copy link

kuk commented Dec 10, 2021

Knowing cardinality of sets union you can find cardinality of sets intersection. There is a formula in math:

cardinality(A ∩ B) = cardinality(A) + cardinality(B) - cardinality(A ∪ B)

For cardinality(A ∪ B) we have .merge, for cardinality(X) we have .cardinality.

@ascv
Copy link
Owner

ascv commented Dec 10, 2021

And what happens when each estimate is 2% off in the same direction? The total error is then approximately 6% which is 3 times the error you normally get. Now extending this example, what happens when you use this approach on 1000 sets? Even if the error is low it is easy to see how the error quickly compounds to produce inaccurate results.

This was why my original preference was to "wait and see" and let people implement intersection estimates themselves. In the case of two sets, using inclusion-exclusion principle, it is trivial to do:

card_a = a.cardinality()
card_b = b.cardinality()
a.merge(b)
card_ab = card_a + card_b - a.cardinality() 

All this being said, you're not the first person to ask for an intersection feature. Maybe I should re-consider my original stance since "perfect is the enemy of good." If we used inclusion-exclusion, I would need to put a disclaimer in the README regarding the accuracy of this method.

@kuk
Copy link

kuk commented Dec 10, 2021

And what happens when each estimate is 2% off in the same direction? The total error is then approximately 6% which is 3 times the error you normally get

Oh, yes good point, did not consider nonoptimal error estimate

@jianshu93
Copy link

jianshu93 commented Dec 6, 2023

Hello All,

I think the answer is not to use that math equation above (also in the Dashing paper) for Jaccard index, which is know to have much larger variation than MinHash. the joint MLE estimator for cardinality in the Ertl 2017 paper is the most accurate, a well written paper with example code in c++, a C version can be derived from my perspective. FYI the Ultraloglog (https://arxiv.org/abs/2308.16862) is out with new FGRA estimator, even more space efficient while more accurate than the improved estimator and raw estimator, the closest one to the Camera-Rao lower bound (MLE method meets the lower bound but very slow, see paper for details). Should be very useful if we have a C-python version to use. The best for set similarity is MinHash (more recent is the optimal densification MinHash), the much smaller variance than whatever HLL based Jaccard, limited by the sqrt(1.079/m) RMSE bound.

Thanks,

Jianshu

@ascv
Copy link
Owner

ascv commented Apr 17, 2024

@jianshu93 Thanks for bringing to my attention this paper. I gave it a quick skim and it looks promising. I think it would be best implemented as a separate class (e.g. UltraLogLog). This project is basically a side hobby of mine that I work on very infrequently so I can't make any promises. The current plan is to fix #46 and then I'll look into the UltraLogLog implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants