JLpeg: Pattern Matching and Parsing For Julia

JLpeg provides a fast PEG engine for matching patterns in strings, using a bytecode virtual machine based on the pioneering work of Roberto Ierusalimschy.

Compared to regular expressions, PEGs offer greater power and expressivity, being, in a sense, a formalized and extended version of the deviations from regular languages offered by production regex engines such as PCRE. PEGs are able to parse recursive rule patterns, employ lookahead and lookbehind predicates, and avoid the sort of worst-case complexity the regex is prone to, for most useful patterns.

Compared with parser combinators, a more common algorithm for matching PEG grammars, the approach taken by this package is superior. A bytecode interpreter allows several key optimizations which parser combinators do not allow; generally such libraries choose between a naive backtracking algorithm with bad time complexity and a memorizing packrat algorithm which trades this for bad space complexity, with consequent memory pressure. JLpeg generates programs which may be inspected and modified, and uses an innovative thrown-label pattern to allow excellent error reporting and recovery.

Compared with a "compiler compiler" such as ANTLR or the classic yacc/bison, JLpeg does not generate source code, but rather bytecode, which Julia is able to JIT into near-optimal machine code on the fly. PEGs are also far more suitable for scanning, captures, and other pattern-recognition tasks than these programs, which are best suited to full grammars, a task JLPeg also excels at.

Patterns

Parsing Expression Grammars are built out of patterns, which begin with atomic units of recognition and are combined into complex rules which can recognize context free, and some context sensitive, languages. LPeg, JLpeg's inspiration, uses a Snobol-style set of operator overloads as the basic tool for building up patterns, a practice we also follow.

Although many users of JLpeg may prefer to use one of the Dialects, patterns and their combination are the building block of JLpeg recognition engines, that which dialects compile to.

The API of JLpeg hews closely to Lpeg, with several extensions, refinements, and a more natively Julian character.

The basic operations are as follows:

Operator	Description
`P(string::String)`	match a literal String `string`
`P(n::UInt)`	match any `n` characters
`P(-n)`	match if there are at least `n` characters remaining
`P(sym::Symbol)`	match the rule named :sym
`S(s::String)`	match the set of all characters in `string`
`R("xy")`, `R('x','y')`	matches any character between `x` and `y` (Range)
`B(patt)`	match `patt` behind the cursor, without advancing
`patt^n`	match `n` repetitions of `patt` at most
`patt^-n`	match `n` repetitions of `patt` at least
`patt1 * patt2`	match the sequence `patt1` , `patt2`
`patt1 \| patt2`	match `patt1` or `patt2`, in that order
`patt1 - patt2`	match `patt1` if `patt2` does not match
`!patt`, `¬patt`	negative lookahead, succeeds if `patt` fails
`~patt`	lookahead, match `patt` without advancing
`patt1 >> patt2`	match `patt1`, then search the string for the next `patt2`.
`P(true)` or `ϵ`	always succeed
`P(false)` or `∅`	always fail

In keeping with the spirit of LPeg, P"string" is equivalent to P("string"), and this is true for S and R as well. These basic operations are not recursive, and without further modification merely match the longest substring recognized by the pattern, but suffice to match all regulat languages.

Matching

match(pattern::Pattern, string::String) will attempt to match the pattern against the string, returning a PegMatch <: AbstractMatch. In the event of a failure, it returns a PegFail, with the index of the failure at .errpos. Note that unlike regular expressions, JLpeg will not skip ahead to find a pattern in a string, unless the pattern is so constructed. We offer the easy shorthand "" >> patt to convert a pattern into its searching equivalent; P"" matches the empty string, and JLPeg will convert strings and numbers (but not booleans) into patterns when able.

julia> match(P"123", "123456")
PegMatch(["123"])

julia> match(P"abc" * "123", "abc123")
PegMatch(["abc123"])

julia> match(P"abc" | "123", "123")
PegMatch(["123"])

julia> match(P"abc"^1, "abcabcabc")
PegMatch(["abcabcabc"])

julia> match((!S"123" * R"09")^1, "0987654321")
PegMatch(["0987654"])

julia> match("" >> P"5", "0987654321")
PegMatch(["098765"])

julia> match(~P"abc", "abc123")
PegMatch([""])

julia> match(~P"abc", "123abc") # fails

The operators introduce a pattern 'context', where any a <op> b combination where a or b is a Pattern will attempt to cast the other argument to a Pattern. This UI is adequate for light work, but the macros discussed later are much cleaner.

Note that unlike regular expressions, PEG always starts with the first character, any match (other than nothing) returned by a call to match(patt, string) will therefore be a prefix of the string, up to and including the entire string.

Most interesting uses of pattern recognition will call for more than matching the longest substring. For those more complex cases, we have captures and actions.

Captures

A PegMatch defaults to the longest SubString when no captures are provided, or when the pattern succeeds but all captures within fail (#TODO I may change this behavior). To capture only the substring of interest, use C(patt) or just make a tuple (patt,). Don't forget the comma or Julia will interpret this as a group.

julia> match("" >> (P"56",), "1234567")
PegMatch(["56"])

This matches the empty string, fast-forwards to the first 56, and captures it. Note that the pattern is (P"56",), a tuple, not a group; this is syntax sugar for P("") >> C(P("56")).

[ ]	Operation	What it produces
[X]	`C(patt [, key])`,	captures the substring of `patt`
[X]	`(patt,)`,	same as above, note the comma!
[X]	`(patt, key)`	`key` may be `:symbol` or `"string"`
[X]	`Cg(patt [, key])`,	captures a Vector of values produced by `patt`,
[X]	`[patt], ([patt], key)`	optionally tagged with `key`
[X]	`Cp()`	captures `""` so `PegMatch.offsets` has the position
[ ]	`Cc(any)`	places `any` in `.captures` at the current offset
[X]	`Cr(patt [, key])`	Range of indices [start:end] of `patt`, optional `key`
[ ]	`Ce(patt, :key)`,	groups the captures of`patt` and creates an Expr
[ ]	`patt => :key`	with head `:key` and args `[patt]...`

Actions

A pattern may be modified with an action to be taken, either at runtime, or, more commonly, once the match has completed. These actions are supplied with all captures in patt, or the substring matched by patt itself if patt contains no captures of its own.

T(:label) doesn't capture, but rather, fails the match, records the position of that failure, and throws :label. If there is a rule by that name, it is attempted for error recovery, otherwise :label and the error position are attached to the PegFail struct, in the event that the whole pattern fails. The label :default is reserved by JLpeg for reporting failure of patterns which didn't otherwise throw a label.

[ ]	Action	Consequence
[X]	`A(patt, λ)`,	the returns of function applied to the captures of patt
[X]	`patt \|> λ`
[ ]	`Anow(patt, λ)`,	captures `λ(C(patt)...)` at match time, return
[ ]	`patt > λ`	`nothing` to fail the match
[ ]	`patt ./ λ`	fold/reduces captures with `λ`, captures last return
[X]	`T(:label)`,	fail the match and throw `:label`
[X]	`patt % :label`
[ ]	`M(patt, :label)`	mark a the region of `patt` for later reference
[ ]	`K(patt, :label, op)`	checK `patt` against the last marked region with `op`

Rules

To match recursive structures, patterns must be able to call themselves, which is a Rule. A collection of Rules is a Grammar.

As is the PEG convention, a rule reduction uses the left arrow ←, which you can type as \leftarrow (or in fact \lef[TAB]). We overload <= if you happen to hate Unicode. A simple Grammar can look like this:

abc_and = :a <= P"abc" * (:b | P"")
_123s   = :b ← P"123"^1 * :a
abc123  = Grammar(abc_and, _123s)

match(abc123, "abc123123123abc123abc")

The @grammar and @rule macros are much prettier ways to make a pattern, however:

@grammar abc123 begin
    :a  ←  "abc" * (:b | "")
    :b  ←  "123"^1 * :a
end

@rule :a ← ["abc"^0 * "123"]^1

Always use R"az" and S"abc" forms for ranges and sets in these macros, or you'll get the wrong result.

Dialects

The operator-combining pattern primitives are the basis of JLpeg. Being ordinary Julia code, they offer the maximum level of flexibility and power, and are the basis of the rest of the system.

For many uses, users may prefer one of the several dialects provided. These are parsers written in, of course, JLpeg itself, which compile down to ordinary patterns.

Name		Name	Last commit message	Last commit date
Latest commit History 388 Commits
.github/workflows		.github/workflows
benchmark		benchmark
docs		docs
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Plans.md		Plans.md
Project.toml		Project.toml
README.md		README.md
bin		bin
kingjames.txt		kingjames.txt
lpegdebug.lua		lpegdebug.lua
profile.pb.gz		profile.pb.gz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JLpeg: Pattern Matching and Parsing For Julia

Patterns

Matching

Captures

Actions

Rules

Dialects

About

Releases

Packages

Languages

License

mnemnion/JLpeg.jl

Folders and files

Latest commit

History

Repository files navigation

JLpeg: Pattern Matching and Parsing For Julia

Patterns

Matching

Captures

Actions

Rules

Dialects

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages