Covfefe allows context free grammars to be specified using a language that is a superset of BNF.
- Productions
- Terminals
- Alternations
- Sequence Grouping
- Optional Sequences
- Sequence Repetitions
- Character Ranges
- Full Grammar
A production in a context free grammar has the form X -> b
, which indicates that X
can be replaced by b
.
X
is a non-terminal, b
is a string of terminals and non-terminals.
In BNF, a non-terminal is written as <A>
, a terminal is written as 'a'
or "a"
.
An assignment is written as lhs ::= rhs
.
A grammar that produces Hello World
can thereby be expressed as
<S> ::= 'hello' <whitespace> 'world'
<whitespace> ::= ' '
<whitespace> ::= '\t'
Terminals are strings of characters that are delimited either by '
s or "
s.
To express characters such as newlines, tabs and unicode symbols, backslashes can be used:
<S> ::= '\n' | '\t' | '\\' | '\u{1F602}'
This grammar produces a single newline, a single tab, a single \
or a single 😂
.
When adding grammars as strings in Swift code, backslashes need to be escaped.
The above grammar then becomes:
let grammarString = "<S> ::= '\\n' | '\\t' | '\\\\' | '\\u{1F602}'"
In the above example, whitespace can be replaced either by ' '
or by '\t'
.
This can be written as
<whitespace> ::= ' ' | '\t'
Concatenation has a higher precedence than alternations, so the following grammar produces ab
and cd
<S> ::= 'a' 'b' | 'c' 'd'
Parentheses ((
and )
) can be used to group symbols together.
The following grammar produces abd
and acd
:
<S> ::= 'a' ('b' | 'c') 'd'
To mark a sequence as optional, brackets ([
and ]
) can be used.
The following grammar produces abc
and ac
:
<S> ::= 'a' ['b'] 'c'
To repeat a sequence, braces ({
and }
) can be used.
The following grammar produces aba
, abba
, abbba
, etc. but not aa
:
<S> ::= 'a' {'b'} 'a'
To also generate aa
, repetitions can be matched with optional sequences:
<S> ::= 'a' [{'b'}] 'a'
It is preferred to make the entire repetition optional instead of repeating an optional sequence.
The repetition is generated using a left-recursive auxiliary rule.
To make it easier to specify grammars that recognize a large alphabet, character ranges can be used. The following grammar recognizes all upper and lower case roman letters:
<S> ::= 'A' ... 'Z' | 'a' ... 'z'
The lower bound of a character range must be less than or equal the upper bound (e.g. 'b' ... 'a'
is an invalid range).
The bounds of a range must be exactly one character. The characters can consist of multiple unicode scalars.
The grammar that is recognized is the following:
(* Production rules are separated by newlines and optional whitespace *)
syntax = optional-whitespace | newlines | rule | rule, newlines | syntax, newlines, rule, newlines | syntax, newlines, rule;
(* A rule consists of a non-terminal pattern name, an assignment operator and a production expression *)
rule = optional-whitespace, rule-name-container, optional-whitespace, assignment-operator, optional-whitespace, expression, optional-whitespace;
(* Rule names are strings of alphanumeric characters *)
rule-name-container = "<", rule-name, ">";
rule-name = rule-name, rule-name-char | "";
rule-name-char = "[a-zA-Z0-9-_]";
assignment-operator = ":", ":", "=";
(* An expression can either be a concatenation or an alternation *)
expression = concatenation | alternation;
(* An alternation is a sequence of one or more concatenations separated by | *)
alternation = expression, optional-whitespace, "|", optional-whitespace, concatenation;
(* A concatenation is a string of terminals and non-terminals *)
concatenation = expression-element | concatenation, optional-whitespace, expression-element;
(* An atom of a expression can either be a terminal literal, a non-terminal or a group *)
expression-element = literal | rule-name-container | expression-group | expression-repetition | expression-optional;
(* Expression containers *)
expression-group = "(", optional-whitespace, expression, optional-whitespace, ")";
expression-optional = "[", optional-whitespace, expression, optional-whitespace, "]";
expression-repetition = "{", optional-whitespace, expression, optional-whitespace, "}";
(* Terminal literals *)
literal = "'", string-1, "'" | '"', string-2, '"' | range-literal;
range-literal = single-char-literal, optional-whitespace, ".", ".", ".", optional-whitespace, single-char-literal;
(* Strings and characters *)
string-1 = string-1, string-1-char | "";
string-1-char = ? any character except '"' ? | string-escaped-char | escaped-single-quote;
string-2 = string-2, string-2-char | "";
string-2-char = ? any character except "'" ? | string-escaped-char | escaped-double-quote;
single-char-literal = "'", string-1-char, "'" | '"', string-2-char, '"';
(* Escape sequences *)
string-escaped-char = unicode-scalar | carriage-return | line-feed | tab-char | backslash;
backslash = "\\", "\\";
line-feed = "\\", "n";
carriage-return = "\\", "r";
tab-char = "\\", "t";
escaped-double-quote = "\\", '"';
escaped-single-quote = "\\", "'";
unicode-scalar = "\\", "u", "{", unicode-scalar-digits, "}";
unicode-scalar-digits = [digit], [digit], [digit], [digit], [digit], [digit], [digit], digit;
digit = '0' ... '9' | 'A' ... 'F' | 'a' ... 'f';
(* Whitespace *)
newlines = "\n" | "\n", optional-whitespace, newlines;
optional-whitespace = "" | whitespace, optional-whitespace;
whitespace = " " | "\t" | "\n" | comment;
(* Comments *)
comment = "(", "*", comment-content, "*", ")" | "(", "*", "*", "*", ")";
comment-asterisk = comment-asterisk, "*" | "*";
comment-content = comment-content, comment-content-char | "";
comment-content-char = "[^*(]" | comment-asterisk, "[^)]" | comment-open-parenthesis, "[^*]" | comment;
comment-open-parenthesis = comment-open-parenthesis, "(" | "(";