Skip to content

Implementation

Muqsit Rayyan edited this page Jul 12, 2024 · 15 revisions

Tokens

A token is a piece of string that has an influence on the expression it occurs within. For example, in the expression x + y * 2, the tokens are x, +, y, *, and 2. Whitespace is not considered a token as it has no impact on the expression. This method of composing tokens gives the writer the liberty to format an expression as they desire. Below is a table of all tokens within arithmexp.

Token Name Token Symbols
Binary Operator + / ** % * - > >= < <= == != === !== <=> && || and or xor
Boolean Literal true false
Function Call [A-Z] [a-z] [0-9] _ followed by parenthesis
Function Call Argument Separator ,
Identifier [A-Z] [a-z] [0-9] _
Numeric Literal [0-9] .
Opcode + && and / == != ** > >= === !== < <= % * || or - xor !
Parenthesis ( ) [ ] { }
Unary Operator + - !

Tokenization

The Scanner class takes the role of constructing an array of tokens from a given expression string (in other words, the Scanner performs tokenization). The Scanner consists of an array of token builders such as a numeric literal token builder (which scans for 1, 2.34, etc.), a parenthesis token builder (which scans for (, and )), etc. Tokenization is the very first stage of parsing.

Expression Notation

The parser spends most of its time converting a given expression from infix notation (x + y) to postfix notation (x y +).

In infix notation, binary operators occur in-between their operands (this method of writing expressions is the one most of us are familiar with). For example, in the infix expression 2 + 3, the operator + occurs in between the left operand 2 and the right operand 3. In postfix notation, binary operators occur after their operands. For example, in the expression 2 3 +, the operator + is placed after the left operand 2 and the right operand 3. This notation is commonly used internally in computing for a few reasons.

In the postfix expression 2 3 add, the operand 2 is evaluated first, followed by the operand 3, finally followed by the operator add which executes the operation of addition. In contrast to the infix notation 2 add 3, or even the prefix notation add(2, 3), the postfix notation accurately describes the order of operations being performed, leaving no room for uncertainty (how do we know which function is called first in the prefix expression f(g(x), h(x))?). Furthermore, parentheses, operator precedence, and operator associativity are not required during the evaluation of a postfix expression. This shifts the computational cost of performing such operations to the parser rather than the evaluator. The expression 2 + 3 * 4 in postfix notation with * taking precedence over + can be written as 2 3 4 * +. The expression (2 + 3) * 4 would be 2 3 + 4 * in postfix notation. Postfix converts a binary tree structure such as ((2 * 3) / (4 + 2)) - 7 to a linear structure 2 3 * 4 2 + / 7 -, allowing low-memory and high-performant runtime (evaluation) using stack operations (see stack-oriented programming).

Expression Conversion

The Parser takes the role of performing infix-to-postfix notation conversion. Before this conversion is performed, the series of tokens are transformed to a binary tree structure following three steps:

  1. In the first step, the expression is de-parenthesized by introducing nesting. A token array for an expression such as (x + y) * z looks like [LP, VAR(x), OP(+), VAR(y), RP, OP(*), VAR(z)]. After this operation is performed, the linear token array becomes a tree-like structure consisting of no parenthesis tokens [[VAR(x), OP(+), VAR(y)], OP(*), VAR(z)].
  2. In the second step, the unary operators are transformed into binary operators. The result of this transformation is that an infix expression such as x + -y gets interpreted as x + (-1 * y). After this operation is performed, no unary operator tokens exist within the token tree.
  3. In the third step, binary operators are grouped based on their precedence and associativity values. An infix expression such as x + y * z + w becomes (x + (y * z)) + w. After this operation is performed, each nested entry would have a size of 3 if it were a valid binary expression.

With these transformations performed, it becomes a lot simpler to perform the conversion. Each nested entry is visited and the position of the binary operator is swapped with its right operand. The resulting token tree is then flattened in the order of occurrence of tokens, and this becomes the postfix notation for the given infix expression.

Expression Evaluation

The Parser returns an Expression instance on successfully parsing. The Expression holds the postfix token array for the given expression. Evaluation of the expression is done by performing stack operations on the postfix token array.

stack = []
for each token in postfixTokens do
    if token is an operator then
        left ← pop stack
        right ← pop stack
        valueleft operator right
        push value to stack
    otherwise
        assert token is numeric
        push token to stack
value ← pop stack
return value

It is possible for an expression to completely skip postfix expression evaluation and yield the value directly when the parser is built using optimizations (i.e., Parser::createDefault() instead of Parser::createUnoptimized()). An expression such as 3 - 2 or x / x is precomputed during parsing and is resolved to a ConstantExpression and directly returns the value 1 during evaluation without traversing any token whatsoever.

1. Quick Start

1.1. Installation

2. Documentation Notes

2.1. Constant
2.2. Function
2.3. Macro
2.4. Operator
2.5. Variable
2.6. Expression Optimization
2.7. Implementation Internals

Clone this wiki locally