Skip to content

Compression TreeModel

Kannan Vijayan edited this page Sep 6, 2018 · 2 revisions

For notational clarity, we introduce terms and several concepts here.

Values

BinaryAST trees can be considered as being composed of a collection of related values.

Values in the tree are partitioned into a set of types, which constrain values and their relationships.

These types are described below.

Value Types

  1. Primitive Types

Primitive types include Null, Bool, Uint, Int, and Float64.

I'll assume the definition of these types does not require exposition.

One thing to note is that the Uint and Int types are ostensibly unlimited precision, but in practice will be limited to some implementation-specific limit.

The set of primitive types and their representation is fixed.

  1. Enum Types

Enum types are named types which permit a finite list of candidate names as their value set. Each enum type has a distinct set of names, and each enum name within an enum type is unique to that enum.

The set of enum types and their contents are specified by the BinAST schema being used to encode or decode a file.

  1. String Types

String types are simply type-tagged strings. The tag serves to distinguish and segregate strings which serve represent different domains of data in the tree.

For example, a schema may want to distinguish between literal strings, identifier names, and regular expression strings - although all values have underlying string representations.

  1. Array Types

Array types identify an array of values. Constraints on the length of the array are not present, but a type bound constrains the values that can occur within the array.

Array values carry their length and a mapping from every index 0 <= i < length to a contained value.

  1. Struct Types

Struct types are collections of named values. Every struct type is associated with a set of names, and a map from every name to a type bound.

Struct types fall into two categories: Node structs which identify AST nodes, and Plain structs which identify simple structured collections of values which don't represent an AST node.

Type Groupings

It's useful to distinguish between different subgroups of types based on their contents and behaviour.

  1. Scalar Types

Scalar types are those that describe values which are terminal - they contain no children.

Scalar values are pure and have no distinct identity. Two identical scalar values occurring at different locations in the tree are treated as two copies of the underlying scalar value, not two references to the same scalar value.

  1. Composite Types

Composite types describe values which contain other values. This containment.

Composite values carry a unique identity. As the structure we are modeling is a tree, the a composite value can be referred to from exactly one location in the tree.

When the value being contained is a Node structure, then the contained value is described as a child, and the container described as the parent.

Trees

Under the above regimen, the tree structure we model is composed of scalar values at the leaves, and composite values at inner nodes.

Edges

Every edge in the tree represents the containment of one value within the other. Edges are labeled by unsigned integers or names depending on whether they describe containment within an array or structure, respectively.

Array values implicitly contain a length value along an edge of the same name.

Values contained within an array are labeled by their position within the array (0 to length - 1).

Values contained within a struct are labeled by the name of their field.

We consider a "typed edge" as a pair:

  TypedEdge = (ContainerType, EdgeName)

where ContainerType identifies the composite type for the container and Edge identifies the name of the edge from the container to a contained value.

For example: the typed edge identifying the value at index 5 in an array of type Array<Uint> would be (Array<Uint>, 5).

A given value matches an edge if that value is contained along an edge with name EdgeName, within a composite value of type ContainerType.

Paths

A Path is a sequence of typed edges identifying a path in the tree.

  Path = TypedEdge[0..]

Every value in a tree is identified by a unique path.

A tree is said to contain a path if starting from the root value of the tree, the sequence of TypeEdges selects a valid path down the tree to some subtree.

The root value of an AST is associated with the empty path ([]).

Consider the following AST:

// Code: foo + bar.zoom

Script {
  statements: [
    ExpressionStatement {
      expression: BinaryExpression {
        operator: BinaryOperator.Add,
        left: IdentifierExpression {
          name: ident("foo")
        },
        right: StaticMemberExpression {
          object: IdentifierExpression {
            name: ident("bar")
          },
          property: prop("zoom")
        }
      }
    }
  ]
}

The path for the ident("bar") value would be:

[ (Script,                  "statements"),
  (Array<Statement>,        0),
  (ExpressionStatement,     "expression"),
  (BinaryExpression,        "right"),
  (StaticMemberExpression,  "object"),
  (IdentifierExpression,    "name")
]

Path Suffixes

A path suffix is some suffix of a sequence of paths (which makes them paths themsleves).

The path suffix of length K for a given element identifies some path from a containing subtree to a value.

From the example above, the path suffix of length 2 for the same ident("bar") element would be:

[ (StaticMemberExpression,  "object"),
  (IdentifierExpression,    "name")
]

This suffix identifies the path to the value within the subtree

StaticMemberExpression {
  object: IdentifierExpression {
    name: ident("bar")
  },
  property: prop("zoom")
}

String Values

There is no universal string type. Instead each tree schema must define a collection of distinguished string types, each tagged with a unique name.

These string type tags serve to classify strings for different purposes within the tree, allowing for different policy treatments of strings based on category.

This decision allows us to avoid pattern-matching on subtrees simply to identify the role of a string (e.g. a property name, identifier, literal string, interpolated string fragment, etc.).

All unique string values are organized into a string table, and strings in the content replaced by references into this table.