Lexepars is a concise, lightweight, but powerful monadic parser combinator library. Monadic means that you always get conclusive parsing results. Combininig easier parsers together allows you to build however complex grammars (which are, in turn, parsers). The library can compete with the full-featured parser generation suites such as Antlr while not possessing their wordiness and the overhead of dealing with the custom grammar specification languages, multiple tools in the chain and the code "bestowed upon you" in your project. What makes the Lexepars so powerful is its separate tokenization and parsing phases. A perfect illustration of that is the only dotnet parser combinator lib capable of parsing the off-side rule languages where the indents are part of the syntax, even as tricky as YAML. Check out the YAML Grammar! It's written entirely in C# and is fully CLS-compliant, so you can create your grammars right in your code in any dotnet language. And don't forget to set up the NuGet symbol server to make debugging a breeze:) Lexepars is deemed as the natural evolution of Patrick Lioi's Parsley with an emphasis made on memory and speed optimizations for very big input files, as well as client code conciseness and flexibility of usage. It can parse context-sensitive, infinite look-ahead grammars, but it performs best on LL(1) grammars and languages.
First, install NuGet. Then, install the Lexepars package from the package manager console:
PM> Install-Package Lexepars
It's just a small single assembly, with no strings attached:)
The lexer phase is performed with a prioritized list of regex patterns, and parser grammars are expressed in terms of the tokens produced by the lexer.
Input text being parsed is represented with a InputText
or, better yet, a LinedInputText
instance, which tracks the current parsing position:
var text = new Text("some input to parse");
The lexer phase is implemented by anything that produces an IEnumerable<
Token>
. The default implementation, Lexer
, builds the series of tokens when given a prioritized series of TokenKind
token recognizers.
The most common TokenKind
implementation is PatternTokenKind
, which recognizes tokens via regular expressions, but there are many more. TokenKind
s can be marked as skippable, if you want them to be recognized but discarded:
var text = new Text("1 2 3 a b c");
var lexer = new Lexer(new Pattern("letter", @"[a-z]"),
new Pattern("number", @"[0-9]+"),
new Pattern("whitespace", @"\s+", skippable: true));
Token[] tokens = lexer.ToArray();
Above, the array tokens
will contain 6 Token
objects. Each Token
contains the lexeme ("1", "a", etc), the TokenKind
that matched it, and the Position
where the lexeme was found.
The collection of Token
produced by the lexer phase is wrapped in a TokenStream
, which allows the rest of the system to traverse the collection of tokens in an immutable fashion.
Parser
consumes a TokenStream
and produces a parsed value based on the lexemes of the parsed tokens:
public interface IParser<out TValue> : IGeneralParser
{
IReply<TValue> Parse(TokenStream tokens);
}
When the value is not needed, e.g. for grammar validation purposes, IGeneralParser
comes into play as a valuable means of optimization.
public interface IGeneralParser
{
IGeneralReply ParseGenerally(TokenStream tokens);
}
IReply<out TValue>
and IGeneralReply
describe whether or not the parser succeeded, the parsed value (on success), a possibly-empty error message list, and a reference to the TokenStream
representing the remaining unparsed tokens.
Grammars should inherit from Grammar
to take advantage of the numerous parser primitives and extensions that serve as the building blocks for the rules. Grammars should define each grammar rule in terms of these primitives, ultimately exposing the entry rule as some Parser<TValue>
. Grammar rule bodies may consist of LINQ queries, which allow you to glue together other grammar rules in sequence:
See the integration tests for a sample JSON grammar.
Finally, we can put all these pieces together to parse some text:
var input = "{\"zero\" : 0, \"one\" : 1, \"two\" : 2}";
var lexer = new JsonLexer();
var tokens = lexer.Tokenize(input);
var jobject = (JObject)JsonGrammar.Json.Parse(tokens).Value;
Because of the fast evolution of the library, it would be hard to maintain standalone documentation. Therefore, much attention is paid to keeping the code either self-explanatory, or furnished with the sufficient documentation comments with the advantage of having them available in the compiled assembly. Here are the summarized points of interest to go through:
- Core functionality:
- Samples and instructive code
- Core unit tests
- JSON Grammar (also available as a ) with unit tests
- YAML Grammar (also available as a ) with unit tests
- Unit test fixtures (I bet you'll need tons of tests fot your lexers and grammars!)
- Symbols
Enjoy and happy time coding!