-
Notifications
You must be signed in to change notification settings - Fork 1
Implementation
A token is a piece of string that has an influence on the expression it occurs within. For example, in the expression x + y * 2
, the tokens are x
, +
, y
, *
, and 2
. Whitespace is not considered a token as it has no impact on the expression. This method of composing tokens gives the writer the liberty to format an expression as they desire. Below is a table of all tokens within arithmexp
.
Token Name | Token Symbols |
---|---|
Binary Operator |
+ / ** % * - > >= < <= == != === !== <=> && || and or xor
|
Boolean Literal |
true false
|
Function Call |
[A-Z] [a-z] [0-9] _ followed by parenthesis |
Function Call Argument Separator | , |
Identifier |
[A-Z] [a-z] [0-9] _
|
Numeric Literal |
[0-9] .
|
Opcode |
+ && and / == != ** > >= === !== < <= % * || or - xor !
|
Parenthesis |
( ) [ ] { }
|
Unary Operator |
+ - !
|
The Scanner
class takes the role of constructing an array of tokens from a given expression string (in other words, the Scanner performs tokenization). The Scanner
consists of an array of token builders such as a numeric literal token builder (which scans for 1
, 2.34
, etc.), a parenthesis token builder (which scans for (
, and )
), etc. Tokenization is the very first stage of parsing.
The parser spends most of its time converting a given expression from infix notation (x + y
) to postfix notation (x y +
).
In infix notation, binary operators occur in-between their operands (this method of writing expressions is the one most of us are familiar with). For example, in the infix expression 2 + 3
, the operator +
occurs in between the left operand 2
and the right operand 3
. In postfix notation, binary operators occur after their operands. For example, in the expression 2 3 +
, the operator +
is placed after the left operand 2
and the right operand 3
. This notation is commonly used internally in computing for a few reasons.
In the postfix expression 2 3 add
, the operand 2
is evaluated first, followed by the operand 3
, finally followed by the operator add
which executes the operation of addition. In contrast to the infix notation 2 add 3
, or even the prefix notation add(2, 3)
, the postfix notation accurately describes the order of operations being performed, leaving no room for uncertainty (how do we know which function is called first in the prefix expression f(g(x), h(x))
?). Furthermore, parentheses, operator precedence, and operator associativity are not required during the evaluation of a postfix expression. This shifts the computational cost of performing such operations to the parser rather than the evaluator. The expression 2 + 3 * 4
in postfix notation with *
taking precedence over +
can be written as 2 3 4 * +
. The expression (2 + 3) * 4
would be 2 3 + 4 *
in postfix notation. Postfix converts a binary tree structure such as ((2 * 3) / (4 + 2)) - 7
to a linear structure 2 3 * 4 2 + / 7 -
, allowing low-memory and high-performant runtime (evaluation) using stack operations (see stack-oriented programming).
The Parser takes the role of performing infix-to-postfix notation conversion. Before this conversion is performed, the series of tokens are transformed to a binary tree structure following three steps:
- In the first step, the expression is de-parenthesized by introducing nesting. A token array for an expression such as
(x + y) * z
looks like[LP, VAR(x), OP(+), VAR(y), RP, OP(*), VAR(z)]
. After this operation is performed, the linear token array becomes a tree-like structure consisting of no parenthesis tokens[[VAR(x), OP(+), VAR(y)], OP(*), VAR(z)]
. - In the second step, the unary operators are transformed into binary operators. The result of this transformation is that an infix expression such as
x + -y
gets interpreted asx + (-1 * y)
. After this operation is performed, no unary operator tokens exist within the token tree. - In the third step, binary operators are grouped based on their precedence and associativity values. An infix expression such as
x + y * z + w
becomes(x + (y * z)) + w
. After this operation is performed, each nested entry would have a size of 3 if it were a valid binary expression.
With these transformations performed, it becomes a lot simpler to perform the conversion. Each nested entry is visited and the position of the binary operator is swapped with its right operand. The resulting token tree is then flattened in the order of occurrence of tokens, and this becomes the postfix notation for the given infix expression.
The Parser
returns an Expression
instance on successfully parsing. The Expression
holds the postfix token array for the given expression. Evaluation of the expression is done by performing stack operations on the postfix token array.
stack = [] for each token in postfixTokens do if token is an operator then left ← pop stack right ← pop stack value ← left operator right push value to stack otherwise assert token is numeric push token to stack value ← pop stack return value
It is possible for an expression to completely skip postfix expression evaluation and yield the value directly when the parser is built using optimizations (i.e., Parser::createDefault()
instead of Parser::createUnoptimized()
). An expression such as 3 - 2
or x / x
is precomputed during parsing and is resolved to a ConstantExpression
and directly returns the value 1
during evaluation without traversing any token whatsoever.
Try out arithmexp
on the demo site!
1.1. Installation
2.1. Constant
2.2. Function
2.3. Macro
2.4. Operator
2.5. Variable
2.6. Expression Optimization
2.7. Implementation Internals