langlang is a parser generator and aspires to provide a toolbox for interacting with computer languages in a more general way.
The core piece, and the only thing that is described in the roadmap is the top-down parser generator (based on Parsing Expression Grammars) that is rather feature rich and strives to be as minimal and elegant as possible, without losing sight of performance.
To shape the core piece, and enrich the tooling around extracting and manipulating the meaning of text and structured data, other algorithms and tools will be written along the way. The Rambling section contains some of the ideas about what could be built. And once these ideas get enough shape, they become implementation designs and get implemented.
- [X] Straightforward Grammar definition language
- [X] Full unicode support, you can match emojies
- [X] Matches data structures besides strings
- [X] Full support for left recursive productions
- [X] Operator precedence and associativity definition
- [X] Non-left recursive rules are compiled with zero precedence
- [ ] Indentation based syntax matching (Relative spaces)
- [ ] Transform matched values (Semantic action expressions)
- [-] Error reporting
- [X] Report correct error position (Farthest failure position)
- [ ] Position tracking and reporting (location spans)
- [ ] Option tracking on choice operations
- [X] Throw custom error labels
- [ ] Automatic label insertion
- [-] Error recovery
- [X] Manual Recovery Expression definition
- [X] Associate manually defined labels to manually defined REs
- [ ] Automatic recovery expression generation
- [ ] Pipe multiple grammars through the command line
- [ ] parameterized rules?
- [ ] higher-order rules?
- [ ] packages and import system?
- [X] ABNF (RFC-5234)
- [ ] C
- [ ] C++
- [ ] CPP
- [X] CSV
- [ ] CSS
- [ ] Go
- [ ] Java
- [ ] JavaScript
- [X] JSON
- [ ] JSX
- [ ] HTML (tolerant to failures – invalid)
- [ ] Python (relative space)
- [X] PEG
- [ ] Uniform Resource Identifier (URI): Generic Syntax (RFC-3986)
- [ ] Rust
- [ ] TypeScript
- [ ] XML
- [ ] Allocate values produced from a match within an arena
- [ ] Decode values into the host language’s types
Matches of an input against a grammar will produce an output tree. Currently, the Virtual Machine captures all values successfully matched by default and store them in a stack that is separate from the stack used in the **Production Call** and **Backtracking** mechanisms.
The machine pushes a frame onto the capture stack before it starts
looping through the bytecode. Opcodes that implement matching of
terminals (Any
, Char
, Span
) push matched values onto the current
capture stack frame. A capture stack frame contains a vector of
captured values and an index that tracks which capture values have
been committed.
When popping a backtrack frame from the stack, the Fail
instruction
will drain the values not committed from the frame on top of the
capture stack. That’s how backtracking the values captured within
the same frame is implemented. Notice that the Fail
instruction
also pops frames off the capture stack when it pops a call frame from
the main stack to keep it ballanced with the capture frame pushed by
the opcode Call
that’s popped by Return
when there’s no failure.
The Return
instruction will pop the frame on top of the capture
stack, wrap all its captures in a Value::Node
with the name of the
production and its captured values.
In order to support left recursive calls, the Call
instruction will
commit all captured values before trying to increment the left
recursive bound (rule inc.1
). And Call
that has successfully
incremented the left recursive bound will also pop all currently
commited values of the frame on the top of the capture stack and wrap
them in a Value::Node
to be pushed onto the same frame (rule
lvar.4
).
Code for both Optional
(?
) and ZeroOrMore
(*
) operators is
emitted surrounded by a pair of CapPush
and CapPop
instructions,
and have a CapCommit
instruction that executes after the whole
operation is done, upon (handled) failure or success.
As mentioned, by default all matched values are captured. Notice that
predicates NOT
and AND
don’t consume any input, so there are no
values to be captured from their successful match. There are two
operations that might not move the cursor, the STAR
(*
) and the
OPTIONAL
(?
) operators, as they never fail even when the input
doesn’t match.
Read along to see how one can also transform captured values after they match and before they’re returned.
In the original definition of Parsing Expression Grammars, backtracking is used to reset the input cursor to where it was before trying a different parsing expression. If there is no match, the backtracking fails and the cursor is left at the position it was at the beginning of the last Ordered Choice.
To improve error reporting, there’s a heuristic called the Farther Failure Position that introduces a new register in the Virtual Machine to keep track of the cursor up to the last successful match that is immune to backtracking. With that, a more accurate position is picked when reporting an error.
Still in error reporting, the Throw Operator is also provided, so grammar authors can control how a matching error will be reported in certain places. It comes with the burden of having to annotate the grammar, and to pay attention to the fact that overly annotating a grammar is to take less advantage of some PEG features provided by its unlimited look ahead.
The general place where a Throw Operator would be desired is the earlier position on an expression where it’s known that a following match wouldn’t move the cursor. e.g.:
Consider the following piece of a grammar:
IfStatement <- IF LEFTP Expression RIGHTP Body
AllStatements <- IfStatement / ForStatement / WhileStatement ...
The following inputs are examples of inputs that would unnecessarily
trigger the backtrack mechanism in the Ordered Choice of
AllStatements
:
'if x', 'if (', 'if (x'
Even though there is no path to a successful match with the above inputs, and the Ordered Choice will still try all the alternatives. With the Throw Operator, one can signal that no more matches should be attempted and interrupt parsing right away if that one expression fails. e.g.:
IfStatement <- IF LEFTP^ Expression^ RIGHTP^ Body
The Throw Operator can also take an optional parameter with a custom error message. e.g.:
IfStatement <- IF LEFTP^ Expression^"Missing Expression" RIGHTP^ Body
Note: the Throw Operator in the input language expr^l
is syntax
sugar for (expr / ⇑l)
.
The parser will fail at the first error by default (as Parsing Expression Grammars do originally). But an incremental parsing mode is also included, but with annotation costs traded for precision.
When parsing is done incrementally, the Throw Operator won’t interrupt parsing right away. It will instead add a special node to the tree returned by the parser storing information about the error. The parser will then execute the Recovery Expression associated with the (implicitly created) label behind the Throw Operator, which should consume the input until where the matching of another expression can be attempted.
The default Recovery Expression of a label of an instance of the Throw Operator is the following:
Annotation costs come from the
What would it take to build a parser generator tool capable of self-hosting? It can already take a stream of characters, transform it into a tree, and then it can take the tree and traverse it.
There’s now the need of emitting some code that could then be interpreted by the virtual machine that has been used run the parsing and the translating. Besides choosing a format for outputting the emitted code, it will also be necessary to provide auxiliary tooling for introspecting the output stream. Introspection in the sense of reading from arbitrary positions within the output stream, and knowing where the writing cursor is (e.g.: for creating labels).
So, before being capable of self-hosting, this tool has to be able to describe an entire compiler. A first good exercise would be to try and implement something similar to what is described in the article “PEG-based transformer provides front-, middle and back-end stages in a simple compiler” (Piumarta, 2010). langlang isn’t very far from achieving that. There are two main missing steps:
- semantic actions to allow transformations to be performed on the values produced upon matches.
- a modular output stream, that can encode values in different formats. The backtracking of the parser is the reason why it’s complicated to allow side effects on semantic actions. Our options to deal with that are to either build an output stream that is aware of the backtracking, or to apply the output rules after matching with a classic visitor style.
The initial format of the output stream can be text based, and the proof that it’d work is the ability to compile grammars into their correct text representation of the bytecode that the virtual machine can interpret. There’s some
- [X] Parse input file with language grammar and get an AST
- [X] Generate the tree traversal grammar off the original grammar
- [ ] Traverse tree grammar and output code (ugly print)
- [ ] Merge overrides to the default settings of the code emitter
- [ ] Command line usage
Usage: langlang print [OPTIONS] INPUT_FILE OPTIONS [-g|--grammar] GRAMMAR Grammar file used to parse INPUT_FILE [-o|--output] OUTPUT Writes output into OUTPUT file --cfg-tab-width NUMBER Configure indentation size --cfg-max-line-width NUMBER Maximum number of characters in a single line --cfg-add-trailing-comma Add trailing comma to the end of last list entry
- [X] Parse both files and get their AST
- [ ] Apply tree diff algorithm
- [ ] Display results
- [ ] Command line usage
langlang diff file_v1.py file_v2.py langlang diff file.py file.js
- [ ] Undo/Redo
- [ ] LSP Server
- [ ] CRDT Storage
- [ ] AST diff
- [ ] Language Server Protocol
- [ ] Text Editing
- [ ] Rendering Engine
- [ ] Configuration Language
Suppose we can parse a .ln
file with a given grammar lang.peg
.
That’d give us an AST as output. One option is to write the
translator as a tree traversal for that AST that emits code. That
will take one of those traversals per language that needs to be
supported. That’d double the burden on the user’s side, since
there was already the need of putting together the language
grammar.
In order to automate some of the process, one could maybe take the
lang.peg
file as input and produce a lang.translator.peg
, in
which rules that output trees would be translated into rules that
could also take structured data as input. Take the following
rules as an example:
Program <- _ Statement+ EndOfFile
Statement <- IfStm / WhileStm / AssignStm / Expression
AssignStm <- Identifier EQ Expression
IfStm <- IF PAROP Expression PARCL Body
WhileStm <- WHILE PAROP Expression PARCL Body
Body <- Statement / (CUROP Statement* CURCL)
# (...)
IF <- 'if' _
WHILE <- 'while' _
EQ <- 'eq' _
PAROP <- '(' _
PARCL <- ')' _
CUROP <- '{' _
CURCL <- '}' _
# (...)
The following output would be generated:
Program <- { "Program" _ Statement+ EndOfFile }
Statement <- { "Statement" IfStm / WhileStm / AssignStm / Expression }
AssignStm <- { "AssignStm" Identifier EQ Expression }
IfStm <- { "IfStm" IF PAROP Expression PARCL Body }
WhileStm <- { "WhileStm" WHILE PAROP Expression PARCL Body }
Body <- { "Body" Statement / (CUROP Statement* CURCL) }
# (...)
IF <- { "IF" Atom }
WHILE <- { "WHILE" Atom }
EQ <- { "EQ" Atom }
PAROP <- { "PAROP" Atom }
PARCL <- { "PARCL" Atom }
CUROP <- { "CUROP" Atom }
CURCL <- { "CURCL" Atom }
# (...)
Atom <- !{ .* } .
With that, we’d know how to traverse any tree returned by the
original lang.peg
. We could then build a general traversal that
walks down the tree, printing out what was matched.
There is one type of information that is not available in the original grammar though. The specifics of each language! For example, in Python, default values for named arguments aren’t supposed to have spaces surrounding the equal sign e.g.:
def complex(real, imag=0.0):
return # (...)
But that’s not the same as in JavaScript:
function multiply(a, b = 1) {
return a * b;
}
To the same extent, minification rules for Python would be different from most other languages as well, given its indentation based definition of scopes.
The good news is that most of these differences, if not all, can be encoded as options available for all languages, leaving the user with a much smaller burden of defining only the overrides for each language that demands options that differ from the defaults in the code emitter.
In langlang, modules are recursive containers for other modules and for grammars.
--------
Module |
---|
Rule1 |
Rule2 |
RuleN |
--------
type Offset usize;
type SymbolName String;
struct Module {
filename: String,
// module in which this module was declared
parent: Option<Module>,
// modules declared within this module
modules: Vec<Module>,
// symbols provided by this module
symbols: HashMap<SymbolName, Offset>,
// symbols used in this module but declared somewhere else
missing: HashMap<SymbolName, Offset>,
}
$ mkdir -p ./lib/base # directory structure for user defined grammars
$ edit ./lib/base/rule.langlang # write a grammar
$ llsh lib::base::rule https://url.with.test/case # a file lib/base/rule.binlang will be created
$ llsh -i. lib::base::rule https://url.with.test/case # previous example worked because `-i./' is implicit
$ llsh -i./lib base::rule https://url.with.test/case # full name differs depending on where the root starts
$ MODULE_SEARCH_PATH=./lib llsh base::rule https://url.with.test/case # search path can be extended via env var
When a symbol is requested, a look up to the symbol table is issued
and, if it is present there, its address is returned. If it is
not, then the BinaryLoader
looks for it within the bytecode
cache, and if it’s not there, it will go through each search path
and try to find it in the file system.
# from stdin
echo https://clarete.li/langlang | llsh lib::rfc3986
# from a file
llsh lib::rfc5234 ~/lib/rfc3986.abnf
# from a URL
llsh lib::json https://jsonplaceholder.typicode.com/users
# interactive
llsh lib::peg
>> S <- 'a' / 'b'
state = <pc, s, e, c>
<pc, s, e, c> – Char a –> <pc+1, s, e, a:c> <pc, s, e, c> – Choice i –> <pc+1, s, (pc+i,s,c):e, c>
Success
Throw f <pc,s,e> -----------→ Fail<f,s,e>
- inside choice
p / throw(label)
when
p
fails: -> log error tuple(location(), label)
-> run expression withinR(label)
- inside predicate
!(p / throw(label))
when
p
succeeds: -> return labelfail
whenp
fails: ->R
is empty for predicates, so returnthrow
doesn’t do anything,label
is discarded and the operation succeeds.
Once an expression fails to be parsed and throw
is called, a look
up for label
is made within R
. If a recovery expression is
found, it’s executed with the goal of moving the parser’s input
cursor to right before the first symbol of the next parsing
expression.
Follow Sets
An Expression e
has a FOLLOW
set of symbols that can be
intuitively described as the list of possible characters to be
matched after matching e
.
- Base Case
G <- (E / ⇑l) "x"
The symbol
x
would be the only element of theFOLLOW
set of symbols ofE
. - Recursive Case
G <- (E / ⇑l) (A / B) A <- "x" / "y" B <- "z" / "k"
The
FOLLOW
set ofE
in this case isx, y, z, k
, since any of these symbols could appear right after parsingE
.