-
%parse-param
can now use types that implementClone
(i.e. relaxing the previous stringent requirement that types wereCopy
). -
Document start states in the grmtools book.
-
Allow
lrlex
andnimbleparse
to read from stdin if the path is-
.
-
Catch incorrectly terminated productions. Previously this led to a confusing situation where productions could be merged together.
-
Give better warnings if lexer/grammar can't be read at build time.
-
Allow
%prec
to define a new token in the grammar. Previously if%prec 'x'
was the first mention of "x" then an error would be raised. -
Tell the user where an incomplete action in a grammar started, not finished (since it always "finishes" at the end of the file).
- Improve error messages for conflicts and the like, giving the span of input related to the error.
-
Change generated code to avoid errors about
unsafe
action code in input grammars. -
Hide unstable parts of the API behind an
_unstable_api
feature that is off by default.
-
lrlex now explicitly raises an error when a rule in an input file has leading space. There is a small chance of this breaking existing input files, but it brings lrlex into line with POSIX lex where leading space indicates verbatim code (a concept for which lrlex currently has no support), making porting errors less likely.
-
Report errors on
%epp
declarations in terms of the input file (rather than pretending they're all at line 1, column 1). -
Reorganise internal testing framework.
- Add CTLexerBuilder options for configuring regex behavior.
-
Support
%empty
in productions. This Bison-ism can be used as a signal to readers that a production really is meant to be empty. -
Allow rules to be repeated, with each being treated as a separate production(s). In other words this grammar:
A: 'x'; A: 'y';
is now equivalent to:
A: 'x' | 'y';
This release contains a number of new features and breaking changes. Most of the breaking changes are in advanced/niche parts of the API, which few users will notice. However, four breaking changes might affect a more substantial subset of users: those are highlighted in the "breaking changes (major)" section below.
-
Improved error messages, with various parts of grammar and lexer files now carrying around
Span
information to help pinpoint errors. -
A single error can be related to multiple
Span
s. For example, if you duplicate a rule name in a grammar, all duplicates are reported in a single error. -
The new
cfgrammar::NewlineCache
struct makes it easier to store the minimal information needed to convert byte offsets in an input into logical line numbers. -
lrlex now supports start states.
-
Unused tokens / rules in a grammar are now detected and, by default, reported as errors. The
%expect-unused
can suppress such warnings on a per-token/rule basis.
-
Start states mean that lrlex now interprets
<
in the regular expression differently than before: to restore the previous behaviour, escape<
with\
. For example, the lrlex rule< "<"
now appears to lrlex as an incomplete start state: replacing it with\< "<"
fixes the problem. -
grmtools now bundles many of its type parameters into a
LexexTypes
trait, to avoid forcing users to endlessly repeat multiple arguments. If you specified a customStorageT
with lrlex/lrpar then you will need to change idioms such asDefaultLexeme<StorageT>, StorageT>
toDefaultLexerTypes<StorageT>
. For example, you might need to change:use lrlex::{DefaultLexeme, LRNonStreamingLexer}; ... lexer: &LRNonStreamingLexer<DefaultLexeme<StorageT>, StorageT>,
to:
use lrlex::{DefaultLexerTypes, LRNonStreamingLexer}; ... lexer: &LRNonStreamingLexer<DefaultLexerTypes<StorageT>>,
-
Unused tokens / rules in a grammar are now detected and, by default, reported as errors. For example, the common "trick" to turn lexing errors into parsing errors suggests adding the following to a grammar:
Unmatched -> (): "UNMATCHED" { } ;
Both the
Unmatched
rule and theUNMATCHED
token will be reported as unused. You can tell grmtools that you expect this to happen with the%expect-unused
directive:%expect-unused Unmatched "UNMATCHED"
-
StorageT
is now used to represent parser states (whereas before it was hard-coded to au16
). If you used a customStorageT
, it may no longer be big enough: if this happens, an error will be reported while the grammar is being built. You will then need to increase the size of yourStoargeT
(e.g. you might need to changeStorageT
fromu8
tou16
).
-
Serde support for lrpar now requires enabling the
serde
feature. -
cfgrammar::yacc::grammar::YaccGrammar::token_span
now returnsOption<Span>
rather thanOption<&Span>
. -
cfgrammar::yacc::ast::{Production, Symbol}
no longer deriveEq
,Hash
, andPartialEq
. Since both now carry aSpan
, it's easy to confuse "two {productions, symbols} have the same name" with "at the same place in the input file." -
cfgrammar::yacc::ast::add_rule
's signature has changed from:pub fn add_rule(&mut self, name: String, actiont: Option<String>) {
to:
pub fn add_rule(&mut self, (name, name_span): (String, Span), actiont: Option<String>) {
-
GrammarValidationError
andYaccParserError
have been combined into a structYaccGrammarError
(which replaces the previous enum of that name). The newYaccGrammarError
has a private enum, so will mean fewer semver-breaking changes. -
LexBuildResult
returns on failureErr(Vec<LexBuildError>)
rather thanErr(LexBuildError)
. -
The
lrlex::lexemes
module has been renamed tolrlex::defaults
to better describe what it is providing.
-
Span
has moved fromlrpar
tocfgrammar
. The import is still available vialrpar
but is deprecated (though due to rust-lang/rust#30827 this unfortunately does not show as a formal deprecation warning). -
cfgrammar::yacc::grammar::YaccGrammar::rule_name
has been renamed torule_name_str
. The old name is still available but is deprecated.
-
Move to Rust 2021 edition.
-
Various minor clean-ups.
- Explicitly error if the users tries to generate two or more lexers or parsers
with the same output file name. Previously the final lexer/parser created
"won the race", leading to a confusing situation where seemingly correct code
would not compile. Users can explicitly set an output path via
output_path
that allows multiple lexers/parsers to be generated with unique names.
An overview of the changes in this version:
Lexeme
is now a trait not a struct, increasing flexibility, but requiring some changes in user code.- The build API has slightly changed, requiring some changes in user code.
%parse-param
is now supported.lrlex
provides a new API to make it easy to use simple hand-written lexers instead of its default lexer.
lrpar
now defines a Lexeme
trait not a Lexeme
struct: this allows the
parser to abstract away from the particular data-layout of a lexeme (allowing,
for example, a lexer to attach extra data to a lexeme that can be accessed by
parser actions) but does add an extra type parameter LexemeT
to several
interfaces. Conventionally the LexemeT
type parameter precedes the StorageT
type parameter in the list of type parameters.
lrlex
defaults to using its new DefaultLexeme
struct, which provides a
generic lexeme struct similar to that previously provided by lrlex
(though
note that you can use lrlex
with a lexeme struct of your own choosing).
The precise effects of these changes will depend on how you use grmtools' libraries but in general:
-
You will need to change your lexeme imports from:
use lrpar::Lexeme;
to:
use lrlex::DefaultLexeme; use lrpar::Lexeme;
-
Most references to
Lexeme
will need to refer toDefaultLexeme
. -
Any references to
LRNonStreamingLexer
will need to change from:LRNonStreamingLexer<Lexeme<u32>>
to:
LRNonStreamingLexerDef<DefaultLexeme<u32>, u32>
where
u32
is theStorageT
of your choice.
One of the additional benefits to this change is that it allows lrpar
and
other lexers (e.g. lrlex
) to be clearly separated: lrpar
now only defines
traits which lexers have to conform to.
Several of the functions / structs surrounding the compile-time construction
of grammars have changed: more details are given below, but in most cases
a build.rs
that looks as follows:
use cfgrammar::yacc::YaccKind;
use lrlex::LexerBuilder;
use lrpar::CTParserBuilder;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let lex_rule_ids_map = CTParserBuilder::<u8>::new_with_storaget()
.yacckind(YaccKind::GrmTools)
.process_file_in_src("calc.y")?;
LexerBuilder::new()
.rule_ids_map(lex_rule_ids_map)
.process_file_in_src("calc.l")?;
Ok(())
}
can be changed to:
use cfgrammar::yacc::YaccKind;
use lrlex::CTLexerBuilder;
fn main() -> Result<(), Box<dyn std::error::Error>> {
CTLexerBuilder::new()
.lrpar_config(|ctp| {
ctp.yacckind(YaccKind::Grmtools)
.grammar_in_src_dir("calc.y")
.unwrap()
})
.lexer_in_src_dir("calc.l")?
.build()?;
Ok(())
}
In more detail:
-
lrlex
'sLexerBuilder
has been renamed toCTLexerBuilder
for symmetry withlrpar
. -
CTLexerBuilder
now provides thelrpar_config
convenience function which removes some of the grottiness involved in tying together anlrlex
lexer andlrpar
parser.lrpar_config
is passed aCTParserBuilder
instance to which normallrpar
compile-time options can be applied. -
CTParserBuilder::process_file_in_src
is deprecated in favour ofCTParserBuilder::grammar_in_src_dir
andCTParserBuilder::build
. The latter method consumes theCTParserBuilder
returning aCTParser
which exposes atoken_map
method whose result can be passed to lexers such aslrlex
.The less commonly used
process_file
function is similarly deprecated in favour of thegrammar_path
,output_path
, andbuild
functions. -
LexerBuilder::process_file_in_src
is deprecated in favour ofLexerBuilder::lexer_in_src_dir
andLexerBuilder::build
.The less commonly used
process_file
function is similarly deprecated in favour of thelexer_path
,output_path
, andbuild
functions. -
CTLexerBuilder
's andCTParserBuilder
'sbuild
methods both consume the builder, producing aCTLexer
andCTParser
respectively, which can be queried for additional information.
- The unstable
CTParserBuilder::conflicts
method has moved toCTParser
. This interface remains unstable and may change without notice.
-
Yacc grammars now support the
%parse-param <var>: <type>
declaration. The variable<var>
is then visible in all action code. Note that<type>
must implement theCopy
trait. The generatedparse
function then takes two parameters(lexer: &..., <var>: <type>)
. -
lrlex
now exposes act_token_map
function which creates a module with a parser's token IDs, and allows users to callLRNonStreamingLexer::new
directly. This makes creating simple hand-written lexers much easier (see the newcalc_manual_lex
example to see this in action).
-
Optimise
NonStreamingLexer::span_lines_str
from O(n) to O(log n). -
Deprecate GrammarAST::add_programs in favour of GrammarAST::set_programs.
-
Add support for Yacc's
%expect
and Bison's%expect-rr
declarations. These allow grammar authors to specify how many shift/reduce and reduce/reduce conflicts they expect in their grammar, and to error if either quantity is different, which is a more fine-grained check than theerror_on_conflicts
boolean. -
Generate code with "correct" camel case names to avoid Clippy warnings in code that uses grmtools' output.
- A number of previously public items (functions, structs, struct attributes) have been made private. Most of these were either not externally visible, or only visible if accessed in undocumented ways, but some were incorrectly public.
-
Optimise
NonStreamingLexer::line_col
from O(n) to O(log n) (where n is the number of lines in the file). -
Document more clearly the constraints on what
Span
s one can safely ask a lexer to operate on. Note that it is always safe to use aSpan
generated directly by a lexer: the constraints relate to what happens if a user derives aSpan
themselves. -
Suppress Clippy warnings about
unnecessary_wraps
, including in code generated from grammar files.
-
Export the
Visibility
enum fromlrlex
. -
Ensure that lrpar rebuilds a grammar if its
visibility
is changed.
- Fix a handful of Clippy warnings.
- The
MF
andPanic
recoverers (deprecated, and undocumented, since 0.4.3) have been removed. Please change toRecoveryKind::CPCTPlus
(or, if you don't want error recovery,RecoveryKind::None
).
- The stategraph is no longer stored in the generated grammar, leading to useful savings in the generated binary size.
- The modules generated for compile-time parsing by lrlex and lrpar have
private visibility by default. Changing this previously required a manual
alias. The
visibility
function in lrlex and lrpar's compile-time builders allows a different visibility to be set (e.g.visibility(Visibility::Public)
. Rust has a number of visibility settings and theVisibility
enum
s in lrlex and lrpar reflect this.
lrlex
now uses aLexerDef
which all lexer definitions mustimpl
. This means that if you want to call methods on a concrete lexer definition, you will almost certainly need to importlrlex::LexerDef
. This opens the possibility that lrlex can seamlessly produce lexers other thanLRNonStreamingLexerDef
s in the future.
-
lrlex::NonStreamingLexerDef
has been renamed tolrlex::LRNonStreamingLexerDef
; use of the former is deprecated. -
The
lrlex::build_lex
function has been deprecated in favour ofLRNonStreamingLexerDef::from_str
.
- The statetable and other elements were previously included in the user binary
with
include_bytes!
, but this could cause problems with relative path names. We now include the statetable and other elements in generated source code to avoid this issue.
-
The
Lexer
trait has been broken into two:Lexer
andNonStreamingLexer
. The former trait is now only capable of producingLexeme
s: the latter is capable of producing substrings of the input and calculating line/column information. This split allows the flexibility to introduce streaming lexers in the future (which will not be able to produce substrings of the input in the same way as aNonStreamingLexer
).Most users will need to replace references to the
Lexer
trait in their code toNonStreamingLexer
. -
NonStreamingLexer
takes a lifetime'input
which allows the input to last longer than theNonStreamingLexer
itself.Lexer::span_str
andLexer::span_lines_str
had the following definitions:fn span_str(&self, span: Span) -> &str; fn span_lines_str(&self, span: Span) -> &str;
As part of
NonStreamingLexer
their definitions are now:fn span_str(&self, span: Span) -> &'input str; fn span_lines_str(&self, span: Span) -> &'input str;
This change allows users to throw away the
Lexer
but still keep around structures (e.g. ASTs) which reference the user's input.rustc infers the
'input
lifetime in some situations but not others, so if you get an error:error[E0106]: missing lifetime specifier
then it is likely that you need to change a type from
NonStreamingLexer
toNonStreamingLexer<'input>
.
-
Fix two Clippy warnings and suppress two others.
-
Prefer "unmatched" rather than "unknown" when using the "turn lexing errors into parsing errors" trick.
-
Deprecate
Lexeme::len
,Lexeme::start
, andLexeme::end
. Each is now replaced byLexeme::span().len()
etc. An appropriate warning is generated if the deprecated methods are used. -
Avoid use of the unit return type in action code causing Clippy warnings.
-
Document the "turn lexing errors into parsing errors" technique and extend
lrpar/examples/calc_ast
to use it.
-
Introduce the concept of a
Span
which records what portion of the user's input something (e.g. a lexeme or production) references. Users can turn aSpan
into a string through theLexer::span_str
function. This has several API changes:lrpar
now exports aSpan
type.Lexeme
s now have afn span(&self) -> Span
function which returns theLexeme
's `Span.Lexer::span_str
replacesLexer::lexeme_str
function. Roughly speaking this:becomes:let s = lexer.lexeme_str(&lexeme);
let s = lexer.span_str(lexeme.span());
Lexer::line_col
now takes aSpan
rather than ausize
and, since aSpan
can be over multiple lines, returns((start line, start column), (end line, end column))
.Lexer::surrounding_line_str
is removed in favour ofspan_lines_str
which takes aSpan
and returns a (possibly multi-line)&str
of the lines containing thatSpan
.- The
$span
special variable now returns aSpan
rather than(usize, usize
).
In practise, this means that in many cases where you previously had to use
Lexeme<StorageT>
, you can now useSpan
instead. This has two advantages. First, it simplifies your code. Second, it enables better error reporting, as you can now point the user to a span of text, rather than a single point. See the (new) AST evaluator section of the grmtools book for an example of how code usingSpan
looks. -
The
$span
special variable is now enabled at all times and no longer needs to be turned on withCTBuilder::span_var
. This function has thus been removed.
-
If called as a binary, lrlex now exits with a return code of 1 if it could not lex input. This matches the behaviour of lrpar.
-
Module names in generated code can now be optionally configured with
mod_name
. The names default to the same naming scheme as before. -
Fully qualify more names in generated code.
-
lrlex_mod
andlrpar_mod
now take strings that match the paths ofprocess_file_in_src
. In other words what was:... .process_file_in_src("a/b/grm.y"); ... lrpar_mod!(grm_y);
is now:
... .process_file_in_src("a/b/grm.y"); ... lrpar_mod!("a/b/grm.y");
and similarly for
lrlex_mod
. This is hopefully easier to remember and also allows projects to have multiple grammar files with the same name. -
The
Lexer
API no longer requires mutability. What was:trait Lexer { fn next(&mut self) -> Option<Result<Lexeme<StorageT>, LexError>>; fn all_lexemes(&mut self) -> Result<Vec<Lexeme<StorageT>>, LexError> { ... } ... }
has now been replaced by an iterator over lexemes:
trait Lexer { fn iter<'a>(&'a self) -> Box<dyn Iterator<Item = Result<Lexeme<StorageT>, LexError>> + 'a>; ... }
This enables more ergonomic use of the new zero-copy feature, but does require changing structs which implement this trait.
lrlex
has been adjusted appropriately.In practise, the only impact that most users will notice is that the following idiom:
let (res, errs) = grm_y::parse(&mut lexer);
will produce a warning that the
mut
part of&mut
is no longer needed.
-
Add support for zero-copying user input when parsing. A special lifetime
'input
is now available in action code and allows users to extract parts of the input without callingto_owned()
(or equivalent). For example:Name -> &'input str: 'ID' { $lexer.lexeme_str(&$1.map_err(|_| ())?) } ;
See
lrpar/examples/calc_ast/src/calc.y
for a more detailed example.
-
Generated code now uses fully qualified names so that name clashes between user action code and that created by grmtools is less likely.
-
Action types can now be fully qualified. In other words this:
R -> A::B: ... ;
means that the rule
R
now has an action typeA::B
.
-
Deprecate the MF recoverer: CPCT+ is now the default and MF is now undocumented. For most people, CPCT+ is good enough, and it's quite a bit easier to understand. In the longer term, MF will probably disappear entirely.
-
License as dual Apache-2.0/MIT (instead of a more complex, and little understood, triple license of Apache-2.0/MIT/UPL-1.0).
- Action code uses
$
as a way of denoting special variables. For example, the pseudo-variable$2
is replaced with a "real" Rust variable by grmtools. However, this means that$2
cannot appear in, say, a string without being replaced. This release uses$$
as an escaping mechanism, so that one can write code such as"$$1"
in action code; this is rewritten to"$1"
by grmtools.
- Newer versions of rustc produce "deprecated" warnings when trait objects are
used without the
dyn
keyword. This previously caused a large number of warnings in generated grammar code fromlrpar
. This release ensures that generated grammar code uses thedyn
keyword when needed, removing such warnings.
-
Lexeme::empty()
has been renamed toLexeme::inserted()
. Although rare, there are grammars with empty lexemes that aren't the result of error recovery (e.g. DEDENT tokens in Python). The previous name was misleading in such cases. -
Lexeme insertion is no longer explicitly encoded in the API for lexemes end/length. Previously these functions returned
None
if a lexeme had been inserted by error recovery. This has proven to be more effort than it's worth with variants on the idiomlexeme.end().unwrap_or_else(|| lexeme.start())
used extensively. These definitions have thus been simplified, changing from:pub fn end(&self) -> Option<usize> pub fn len(&self) -> Option<usize>
to:
pub fn end(&self) -> usize pub fn len(&self) -> usize
- A new pseudo-variable
$span
can be enabled within parser actions ifCTBuilder::span_var(true)
is called. This pseudo-variable has the type (usize, usize) where these represent (start, end) offsets in the input and allows users to determine how much input a rule has matched.
- Some dynamic assertions about the correct use of types have been converted to static assertions. In the unlikely event that you try to run grmtools on a platform with unexpected type sizes (which, in practise, probably only means 16 bit machines), this will lead to the problems being determined at compile-time rather than run-time.
- Document lrpar more thoroughly, in particular hiding the inner modules, whose location might change one day in the future: all useful structs (etc.) are explicitly exposed at the module level.
-
Have the
process_file
functions in bothLexerBuilder
andCTParserBuilder
place output into a named file (whereas previouslyCTParserBuilder
expected a directory name). -
Rename
offset_line_col
toline_col
and have the latter return character offsets (whereas before it returned byte offsets). This makes the resulting numbers reported to humans much less confusing when multi-byte UTF-8 characters are used.
-
Add
surrounding_line_str
helper function to lexers. This is helpful when printing out error messages to users. -
Add a comment with rule names when generating grammars at compile-time. Thus if user action code contains an error, it's much easier to relate this to the appropriate point in the
.y
file.
- Documentation fixes.
-
Previously users had to specify the
YaccKind
of a grammar and then theActionKind
of actions. This is unnecessarily fiddly, so removeActionKind
entirely and instead flesh outYaccKind
to deal with the possible variants. For exampleActionKind::CustomAction
is now, in essence,YaccKind::Original(YaccOriginalActionKind::UserAction)
. This is a breaking change but one that will make future evolution much easier. -
The
%type
directive in grammars exposed by YaccKind::Original(YaccOriginalActionKind::UserAction) has been renamed to%actiontype
to make it clear what type is being referred to. In general, most people will want to move to theYaccKind::Grmtools
variant (see below), which doesn't require the%actiontype
directive.
-
grmtools has moved to the 2018 edition of Rust and thus needs rustc-1.31 or later to compile.
-
Add
YaccKind::Grmtools
variant, allowing grammar rules to have different action types. For most practical use cases, this is much better than using%actiontype
. -
Add
%avoid_insert
directive to bias ranking of repair sequences and make it more likely that parsing can continue.
-
Add
-q
switch tonimbleparse
to suppress printing out the stategraph and conflicts (some grammars have conflicts by design, so being continually reminded of it isn't helpful). -
Fix problem where errors which lead to vast (as in hundreds of thousands) of repair sequences being found could take minutes to sort and rank.
-
Add
YaccKind::Original(YaccOriginalActionKind::NoAction)
variant to generate a parser which simply tells the user where errors were found (i.e. no actions are executed, and not even a parse tree is created). -
lrlex
no longer tries to create Rust-level identifiers for tokens whose names can't be valid Rust identifiers (which led to compile-time syntax errors in the generated Rust code).
- Fix bug where
%epp
strings with quote marks in caused a code-generation failure in compile-time mode.
First stable release.