-
Notifications
You must be signed in to change notification settings - Fork 38
Compression StringWindows
String Windows predict the probability of particular string values by using a move-to-front cache of strings.
The cache is of fixed size, and is initialized to contain no strings. As the tree is encoded and strings are observed, the cache is updated with each observed string.
The update procedure is a simple move-to-front strategy:
If a string is present in the table at index K
, it is
moved to index 0
, shifting all values previously at
indexes 0
to K-1
up by 1.
The String Windows prediction model introduces its own
alphabet, whose symbols are the set {0 ... K-1, MISS}
for a window of size K.
The symbols {0 ... K-1}
correspond to the indexes of
the string window cache. The MISS
symbol corresponds
to cases where the string reference being encoded is
not found within the cache.
Probabilities are generated on these symbols through analysis of a corpus of source code.
The string window method is used to accumulate hit-counts for each index in the cache, as well as a miss count.
These hit-counts are then converted into a frequency-based probability distribution over the indexes of the cache (plus the MISS).
This derived probability distribution is static, and assumed by the format.
When encoding or decoding a string under this model, we do the following:
- Look up the string in the string window.
- If the string is found, encode the index at which it using the static index probability table.
- If the string is not found, encode the
MISS
symbol, and then encode the string-ref using some fallback model (not specified).
We maintain a separate string window for each different string type - see the String Values section in Compression/TreeModel for details.
Measurements indicate a window size of 64 would be appropriate for our needs.
(TODO: Add reports from corpus analysis)