Skip to content

Latest commit

 

History

History
503 lines (410 loc) · 17.6 KB

sketch.org

File metadata and controls

503 lines (410 loc) · 17.6 KB

Introduction

langlang is a parser generator and aspires to provide a toolbox for interacting with computer languages in a more general way.

The core piece, and the only thing that is described in the roadmap is the top-down parser generator (based on Parsing Expression Grammars) that is rather feature rich and strives to be as minimal and elegant as possible, without losing sight of performance.

To shape the core piece, and enrich the tooling around extracting and manipulating the meaning of text and structured data, other algorithms and tools will be written along the way. The Rambling section contains some of the ideas about what could be built. And once these ideas get enough shape, they become implementation designs and get implemented.

Roadmap

Features

  • [X] Straightforward Grammar definition language
  • [X] Full unicode support, you can match emojies
  • [X] Matches data structures besides strings
  • [X] Full support for left recursive productions
  • [X] Operator precedence and associativity definition
  • [X] Non-left recursive rules are compiled with zero precedence
  • [ ] Indentation based syntax matching (Relative spaces)
  • [ ] Transform matched values (Semantic action expressions)
  • [-] Error reporting
    • [X] Report correct error position (Farthest failure position)
    • [ ] Position tracking and reporting (location spans)
    • [ ] Option tracking on choice operations
    • [X] Throw custom error labels
    • [ ] Automatic label insertion
  • [-] Error recovery
    • [X] Manual Recovery Expression definition
    • [X] Associate manually defined labels to manually defined REs
    • [ ] Automatic recovery expression generation
  • [ ] Pipe multiple grammars through the command line
  • [ ] parameterized rules?
  • [ ] higher-order rules?
  • [ ] packages and import system?

Standard Library

  • [X] ABNF (RFC-5234)
  • [ ] C
  • [ ] C++
  • [ ] CPP
  • [X] CSV
  • [ ] CSS
  • [ ] Go
  • [ ] Java
  • [ ] JavaScript
  • [X] JSON
  • [ ] JSX
  • [ ] HTML (tolerant to failures – invalid)
  • [ ] Python (relative space)
  • [X] PEG
  • [ ] Uniform Resource Identifier (URI): Generic Syntax (RFC-3986)
  • [ ] Rust
  • [ ] TypeScript
  • [ ] XML

Chores

  • [ ] Allocate values produced from a match within an arena
  • [ ] Decode values into the host language’s types

Implemented Design

Capturing Values

Matches of an input against a grammar will produce an output tree. Currently, the Virtual Machine captures all values successfully matched by default and store them in a stack that is separate from the stack used in the **Production Call** and **Backtracking** mechanisms.

The machine pushes a frame onto the capture stack before it starts looping through the bytecode. Opcodes that implement matching of terminals (Any, Char, Span) push matched values onto the current capture stack frame. A capture stack frame contains a vector of captured values and an index that tracks which capture values have been committed.

When popping a backtrack frame from the stack, the Fail instruction will drain the values not committed from the frame on top of the capture stack. That’s how backtracking the values captured within the same frame is implemented. Notice that the Fail instruction also pops frames off the capture stack when it pops a call frame from the main stack to keep it ballanced with the capture frame pushed by the opcode Call that’s popped by Return when there’s no failure.

The Return instruction will pop the frame on top of the capture stack, wrap all its captures in a Value::Node with the name of the production and its captured values.

In order to support left recursive calls, the Call instruction will commit all captured values before trying to increment the left recursive bound (rule inc.1). And Call that has successfully incremented the left recursive bound will also pop all currently commited values of the frame on the top of the capture stack and wrap them in a Value::Node to be pushed onto the same frame (rule lvar.4).

Code for both Optional (?) and ZeroOrMore (*) operators is emitted surrounded by a pair of CapPush and CapPop instructions, and have a CapCommit instruction that executes after the whole operation is done, upon (handled) failure or success.

As mentioned, by default all matched values are captured. Notice that predicates NOT and AND don’t consume any input, so there are no values to be captured from their successful match. There are two operations that might not move the cursor, the STAR (*) and the OPTIONAL (?) operators, as they never fail even when the input doesn’t match.

Read along to see how one can also transform captured values after they match and before they’re returned.

Error Handling

In the original definition of Parsing Expression Grammars, backtracking is used to reset the input cursor to where it was before trying a different parsing expression. If there is no match, the backtracking fails and the cursor is left at the position it was at the beginning of the last Ordered Choice.

To improve error reporting, there’s a heuristic called the Farther Failure Position that introduces a new register in the Virtual Machine to keep track of the cursor up to the last successful match that is immune to backtracking. With that, a more accurate position is picked when reporting an error.

Still in error reporting, the Throw Operator is also provided, so grammar authors can control how a matching error will be reported in certain places. It comes with the burden of having to annotate the grammar, and to pay attention to the fact that overly annotating a grammar is to take less advantage of some PEG features provided by its unlimited look ahead.

The general place where a Throw Operator would be desired is the earlier position on an expression where it’s known that a following match wouldn’t move the cursor. e.g.:

Consider the following piece of a grammar:

IfStatement <- IF LEFTP Expression RIGHTP Body
AllStatements <- IfStatement / ForStatement / WhileStatement ...

The following inputs are examples of inputs that would unnecessarily trigger the backtrack mechanism in the Ordered Choice of AllStatements:

'if x', 'if (', 'if (x'

Even though there is no path to a successful match with the above inputs, and the Ordered Choice will still try all the alternatives. With the Throw Operator, one can signal that no more matches should be attempted and interrupt parsing right away if that one expression fails. e.g.:

IfStatement <- IF LEFTP^ Expression^ RIGHTP^ Body

The Throw Operator can also take an optional parameter with a custom error message. e.g.:

IfStatement <- IF LEFTP^ Expression^"Missing Expression" RIGHTP^ Body

Note: the Throw Operator in the input language expr^l is syntax sugar for (expr / ⇑l).

Rambling

Incremental Parsing

The parser will fail at the first error by default (as Parsing Expression Grammars do originally). But an incremental parsing mode is also included, but with annotation costs traded for precision.

When parsing is done incrementally, the Throw Operator won’t interrupt parsing right away. It will instead add a special node to the tree returned by the parser storing information about the error. The parser will then execute the Recovery Expression associated with the (implicitly created) label behind the Throw Operator, which should consume the input until where the matching of another expression can be attempted.

The default Recovery Expression of a label of an instance of the Throw Operator is the following:


Annotation costs come from the

Self Hosting

What would it take to build a parser generator tool capable of self-hosting? It can already take a stream of characters, transform it into a tree, and then it can take the tree and traverse it.

There’s now the need of emitting some code that could then be interpreted by the virtual machine that has been used run the parsing and the translating. Besides choosing a format for outputting the emitted code, it will also be necessary to provide auxiliary tooling for introspecting the output stream. Introspection in the sense of reading from arbitrary positions within the output stream, and knowing where the writing cursor is (e.g.: for creating labels).

So, before being capable of self-hosting, this tool has to be able to describe an entire compiler. A first good exercise would be to try and implement something similar to what is described in the article “PEG-based transformer provides front-, middle and back-end stages in a simple compiler” (Piumarta, 2010). langlang isn’t very far from achieving that. There are two main missing steps:

  1. semantic actions to allow transformations to be performed on the values produced upon matches.
  2. a modular output stream, that can encode values in different formats. The backtracking of the parser is the reason why it’s complicated to allow side effects on semantic actions. Our options to deal with that are to either build an output stream that is aware of the backtracking, or to apply the output rules after matching with a classic visitor style.

The initial format of the output stream can be text based, and the proof that it’d work is the ability to compile grammars into their correct text representation of the bytecode that the virtual machine can interpret. There’s some

Tools

Pretty Print / Minifier

  • [X] Parse input file with language grammar and get an AST
  • [X] Generate the tree traversal grammar off the original grammar
  • [ ] Traverse tree grammar and output code (ugly print)
  • [ ] Merge overrides to the default settings of the code emitter
  • [ ] Command line usage
    Usage: langlang print [OPTIONS] INPUT_FILE
    
    OPTIONS
    
    [-g|--grammar] GRAMMAR
        Grammar file used to parse INPUT_FILE
    
    [-o|--output] OUTPUT
        Writes output into OUTPUT file
    
    --cfg-tab-width NUMBER
        Configure indentation size
    
    --cfg-max-line-width NUMBER
        Maximum number of characters in a single line
    
    --cfg-add-trailing-comma
        Add trailing comma to the end of last list entry
        

Diff

  • [X] Parse both files and get their AST
  • [ ] Apply tree diff algorithm
  • [ ] Display results
  • [ ] Command line usage
    langlang diff file_v1.py file_v2.py
    langlang diff file.py file.js
        

Object Database

  • [ ] Undo/Redo
  • [ ] LSP Server
  • [ ] CRDT Storage
  • [ ] AST diff

Editor

  • [ ] Language Server Protocol
  • [ ] Text Editing
  • [ ] Rendering Engine
  • [ ] Configuration Language

Pretty Print / Minifier

Suppose we can parse a .ln file with a given grammar lang.peg. That’d give us an AST as output. One option is to write the translator as a tree traversal for that AST that emits code. That will take one of those traversals per language that needs to be supported. That’d double the burden on the user’s side, since there was already the need of putting together the language grammar.

In order to automate some of the process, one could maybe take the lang.peg file as input and produce a lang.translator.peg, in which rules that output trees would be translated into rules that could also take structured data as input. Take the following rules as an example:

Program    <- _ Statement+ EndOfFile
Statement  <- IfStm / WhileStm / AssignStm / Expression
AssignStm  <- Identifier EQ Expression
IfStm      <- IF PAROP Expression PARCL Body
WhileStm   <- WHILE PAROP Expression PARCL Body
Body       <- Statement / (CUROP Statement* CURCL)
# (...)
IF         <- 'if'    _
WHILE      <- 'while' _
EQ         <- 'eq'    _
PAROP      <- '('     _
PARCL      <- ')'     _
CUROP      <- '{'     _
CURCL      <- '}'     _
# (...)

The following output would be generated:

Program    <- { "Program" _ Statement+ EndOfFile }
Statement  <- { "Statement" IfStm / WhileStm / AssignStm / Expression }
AssignStm  <- { "AssignStm" Identifier EQ Expression  }
IfStm      <- { "IfStm" IF PAROP Expression PARCL Body }
WhileStm   <- { "WhileStm" WHILE PAROP Expression PARCL Body }
Body       <- { "Body" Statement / (CUROP Statement* CURCL) }
# (...)
IF         <- { "IF" Atom }
WHILE      <- { "WHILE" Atom }
EQ         <- { "EQ" Atom }
PAROP      <- { "PAROP" Atom }
PARCL      <- { "PARCL" Atom }
CUROP      <- { "CUROP" Atom }
CURCL      <- { "CURCL" Atom }
# (...)
Atom       <- !{ .* } .

With that, we’d know how to traverse any tree returned by the original lang.peg. We could then build a general traversal that walks down the tree, printing out what was matched.

There is one type of information that is not available in the original grammar though. The specifics of each language! For example, in Python, default values for named arguments aren’t supposed to have spaces surrounding the equal sign e.g.:

def complex(real, imag=0.0):
    return # (...)

But that’s not the same as in JavaScript:

function multiply(a, b = 1) {
  return a * b;
}

To the same extent, minification rules for Python would be different from most other languages as well, given its indentation based definition of scopes.

The good news is that most of these differences, if not all, can be encoded as options available for all languages, leaving the user with a much smaller burden of defining only the overrides for each language that demands options that differ from the defaults in the code emitter.

Semantic Actions

Modules

In langlang, modules are recursive containers for other modules and for grammars.

--------

Module
Rule1
Rule2
RuleN

--------

type Offset usize;
type SymbolName String;
struct Module {
  filename: String,
  // module in which this module was declared
  parent: Option<Module>,
  // modules declared within this module
  modules: Vec<Module>,
  // symbols provided by this module
  symbols: HashMap<SymbolName, Offset>,
  // symbols used in this module but declared somewhere else
  missing: HashMap<SymbolName, Offset>,
}
$ mkdir -p ./lib/base                                    # directory structure for user defined grammars
$ edit ./lib/base/rule.langlang                          # write a grammar
$ llsh lib::base::rule https://url.with.test/case        # a file lib/base/rule.binlang will be created
$ llsh -i. lib::base::rule https://url.with.test/case    # previous example worked because `-i./' is implicit
$ llsh -i./lib base::rule https://url.with.test/case     # full name differs depending on where the root starts
$ MODULE_SEARCH_PATH=./lib llsh base::rule https://url.with.test/case # search path can be extended via env var

When a symbol is requested, a look up to the symbol table is issued and, if it is present there, its address is returned. If it is not, then the BinaryLoader looks for it within the bytecode cache, and if it’s not there, it will go through each search path and try to find it in the file system.

Shell

# from stdin
echo https://clarete.li/langlang | llsh lib::rfc3986

# from a file
llsh lib::rfc5234 ~/lib/rfc3986.abnf

# from a URL
llsh lib::json https://jsonplaceholder.typicode.com/users

# interactive
llsh lib::peg
>> S <- 'a' / 'b'

Matching

Literal Strings

Left Recursion

Captures

state = <pc, s, e, c>

<pc, s, e, c> – Char a –> <pc+1, s, e, a:c> <pc, s, e, c> – Choice i –> <pc+1, s, (pc+i,s,c):e, c>

Error Handling

Success

Throw f <pc,s,e> -----------→ Fail<f,s,e>

  • inside choice
    p / throw(label)
        

    when p fails: -> log error tuple (location(), label) -> run expression within R(label)

  • inside predicate
    !(p / throw(label))
        

    when p succeeds: -> return label fail when p fails: -> R is empty for predicates, so return throw doesn’t do anything, label is discarded and the operation succeeds.

Once an expression fails to be parsed and throw is called, a look up for label is made within R. If a recovery expression is found, it’s executed with the goal of moving the parser’s input cursor to right before the first symbol of the next parsing expression.

Follow Sets

An Expression e has a FOLLOW set of symbols that can be intuitively described as the list of possible characters to be matched after matching e.

  1. Base Case
    G <- (E / ⇑l) "x"
        

    The symbol x would be the only element of the FOLLOW set of symbols of E.

  2. Recursive Case
    G <- (E / ⇑l) (A / B)
    A <- "x" / "y"
    B <- "z" / "k"
        

    The FOLLOW set of E in this case is x, y, z, k, since any of these symbols could appear right after parsing E.