A compression-based measure mutual information between strings.
There is a close relationship between the entropy of a string and the extent to which it can be losslessly compressed. Namely, according to Shannon's source-coding theorem, there exists no lossless code able to compress a string beyond its entropy. Compression schemes hence allow us to obtain an upper bound on the entropy of a string.
The mutual information between two random variables, denoted
where
It tells us how much our uncertainty in
To compute
Note that we may choose to use different codes for encoding
in which separate codes are used.
Here's an example of ami implemented in Python, using zlib as the encoder.
import zlib
def ami(x: str, y: str) -> float:
"""
Returns the normalized approximate mutual information between
strings x and y.
"""
lx = len(zlib.compress(x))
ly = len(zlib.compress(y))
lyx = len(zlib.compress(y + x))
lx_y = min(lx, max(0, lyx - ly)) # 0 <= L(x|y) <= L(x)
ixy = lx - lx_y
return ixy / lx
AMI generally measures the amount of information shared between strings. If we sample from two independent random variables, the respective samples should share approximately no information, and their normalized AMI score will be very close to zero. In contrast, two identical strings should have a score close to 1, indicating that they're highly correlated. Scores in between 0 and 1 therefore indicate the degree of correlation between strings.
Unlike some other measures, this correlation need not be linear, or even numerical in nature. In fact, any statistical correlation can be modelled simply by changing the compression scheme. Zlib's standard encoder is a reasonable choice due to its efficiency and ability to significantly compress a broad range of file types. However, a priori knowledge about the strings in question can be used to inform the choice of encoder.
One use of AMI is in performing document classification. For example, the distribution of French language strings is sufficiently different from the distribution of Spanish language strings that even a simple encoder such as Zlib can accurately to distinguish between them. For distributions which are closer, a more specific encoder may be required.