Skip to content

Compression StringWindows

Kannan Vijayan edited this page Sep 6, 2018 · 1 revision

String Windows predict the probability of particular string values by using a move-to-front cache of strings.

The cache is of fixed size, and is initialized to contain no strings. As the tree is encoded and strings are observed, the cache is updated with each observed string.

The update procedure is a simple move-to-front strategy: If a string is present in the table at index K, it is moved to index 0, shifting all values previously at indexes 0 to K-1 up by 1.

Alphabet

The String Windows prediction model introduces its own alphabet, whose symbols are the set {0 ... K-1, MISS} for a window of size K.

The symbols {0 ... K-1} correspond to the indexes of the string window cache. The MISS symbol corresponds to cases where the string reference being encoded is not found within the cache.

Prediction

Probabilities are generated on these symbols through analysis of a corpus of source code.

The string window method is used to accumulate hit-counts for each index in the cache, as well as a miss count.

These hit-counts are then converted into a frequency-based probability distribution over the indexes of the cache (plus the MISS).

This derived probability distribution is static, and assumed by the format.

Coding

When encoding or decoding a string under this model, we do the following:

  1. Look up the string in the string window.
  2. If the string is found, encode the index at which it using the static index probability table.
  3. If the string is not found, encode the MISS symbol, and then encode the string-ref using some fallback model (not specified).

Notes

We maintain a separate string window for each different string type - see the String Values section in Compression/TreeModel for details.

Measurements indicate a window size of 64 would be appropriate for our needs.

(TODO: Add reports from corpus analysis)

Clone this wiki locally