theme | class | headingDivider |
---|---|---|
default |
invert |
2 |
Original source is found at https://github.com/dhis2/expression-parser/blob/main/PRESENTATION.md
What is DHIS2 expression language?
// math
1 + 1
// functions
log(100)
// logic
true && 2 > 1
// data
#{abcdefghijk}
Butchered (but useful) terminology:
Terminal:
- something that is not composed of parts
- a leaf in a tree
Examples: literals (numbers, constants, strings)
Non-Terminal:
- something that is composed of parts
- a node in a tree
Examples: operators
NonTerminal = (NonTerminal | Terminal)+
The non-terminal on the left is defined as the sequence of non-terminal and terminals on the right.
More (usually):
a | b
=> a OR b (in order of precedence)+
left block 1 or more times*
left block 0 or more times?
left block 0 or 1 times{n}
left block exactly n times{n,m}
left block at least n but at most m times(a b)
grouping to a block[a-z]
a character set of the letters a-z- or
[a b]
could mean: same as(not here)(a b)?
Terms
- consume: moving current position forward in the input
- gobble: consume and discard input (ignore, like WS)
https://en.wikipedia.org/wiki/Context-free_grammar
- reply on precedence (order of declaration) to express intent
- use a "solver" to find the "best match" given a set of rules
A toy expression language:
expr = term '*' term
| term '+' term
term = number | constant
Example expression:
a + b * c + d
a possible AST
+
/ \
+ d
/ \
a *
/ \
b c
or short (a + (b * c)) + d
; but could also be a + ((b * c) + d)
Parsing is a left to right process...
((a + b) * c) + d
😩 that isn't right... lets go back and try again, this time trying something different
a + ((b * c) + d)
better 😌
Parsing is somewhere between linear and factorial complexity
https://en.wikipedia.org/wiki/Parsing_expression_grammar
- PEGs cannot be ambiguous (unlike CFGs)
- if a string parses, it has exactly one valid parse tree
- (presumably: OR not allowed in production rules)
The toy example again, defined slightly different
expr = term (op term)*
term = number | constant
op = '+' | '*'
After a term
if there is a +
or *
there must be another term
and so forth, otherwise this is an illegal input for the language.
Parsing is linear*
* theoretically one could build backtracking into PEG parsers...
Usually refers to the number of "not yet consumed" characters in the input that need to be considered to make the right choice.
expr = number | date
number = [0-9]+('.'[0-9]+)?
date = [0-9]{4}'.'[0-9]{2}'.'[0-9]{2}
How can we decide that we are reading a number or a date?
=> A date has two .
We need to look ahead and see of if we find the second .
- PEG grammar must be "decidable" purely based on lookahead.
- Ideally the lookahead is 1 character only => fast, simple, unambiguous by design
a + b * c + d
Linearly parsed gives
a + (b * (c + d))
😩
What if operators are leaves? We get:
a, +, b, *, c, +, d
Everything is in a "flat" sequence of typed nodes.
Now we walk the "tree" and merge only the operator with the highest precedence into a structured operator with children:
1. a, +, (b * c), +, d
2. ((a + (b * c)) + d)
Voila. 😌
The time is still linear as there is a fixed number of operators to do.
CFGs (ANTLR)
- bad syntax choices ("collisions") are first recognised much later (solver hides them)
- solver means a framework is used, which means limitations
- multiple transformations because of the layers of abstraction
- whitespace is hard to control as it is implicitly assumed
- worst case complexity is factorial
- => bend your problem to suit the parser
PEGs
- decidability problem forces to recognise and solve collisions right away
- language methods are the "framework"
- direct translation (as complicated as needed but not more)
- just methods calling each other (possibly with a convenience layer on top)
- whitespace is no different but needs explicit consideration and modelling
- generally: "special" handling is not different
- worst case complexity is linear*
- => write the parser to suit the problem
Key idea:
void what(Input in, Context ctx);
- what is the name of the token/block processed
Input
- is whatever is processed and "consumed" while parsing
Context
- is whatever is build, the "output"
- usually emitting the base data for creating nodes in an AST
- also might hold state like lookup by name
Example Grammar:
expr = term (op term)*
term = number | constant
op = '+' | '*'
PEG parser:
void expr(Input in, Context ctx) {
term(in, ctx);
char c = in.lookahead();
while (c == '+' || c == '*') { // isOperator(c)
op(in, ctx);
term(in, ctx);
c = in.lookahead();
}
}
void term(Input in, Context ctx) {
in.consumeWhitespace();
char c = in.lookahead();
if (isDigit(c)) {
number(in, ctx);
} else if (isLetter(c)){
constant(in, ctx);
} else {
throw in.error();
}
in.consumeWhitespace();
}
void op(Input in, Context ctx) {
char op = in.consume();
ctx.emitOperator(op);
}
void number(Input in, Context ctx) {
String n = in.consumeWhile(c -> isDigit(c) || c == '.');
ctx.emitNumber(n);
}
void constant(Input in, Context ctx) {
String c = in.consumeWhile(c -> isLetter(c));
ctx.emitConstant(c);
}