Inspired by only a partial understanding of PEG, I decided to try to implement it. The result may not be the best among general-purpose PEG parsers - there are already a lot of them (for example, TatSu is written in Python and generates Python code) - but this is a good way to understand PEG. In the future, I want to replace it with the current implementation of the parser in CPython.
- Peg parsers
- PEG parser implementation
- PEG parser generation
- PEG parser visualization
- Left recursive peg grammar
- Adding Actions to PEG Grammar
- Implementing the rest of the PEG features
- PEG on Core Developer Sprint
In this section, I lay the foundations for understanding the work of the parser, using an example of a simple self-written implementation of toy grammar from a previous article.
(By the way, as an experiment, I do not place links in my text. If you donβt understand something, you can just google it. :-)
Typically, a PEG uses a recursive descent parser with an unlimited buffer to return. Here is a toy grammar from a previous article:
statement: assignment | expr | if_statement expr: expr '+' term | expr '-' term | term term: term '*' atom | term '/' atom | atom atom: NAME | NUMBER | '(' expr ')' assignment: target '=' expr target: NAME if_statement: 'if' expr ':' statement
The super-abstract parser of a recursive descent for this language will define its function for each rule in which the alternatives will be understood. For example, for statement
, we would have this function:
def statement(): if assignment(): return True if expr(): return True if if_statement(): return True return False
Of course, this is an oversimplified example: it omits important details, for example, what is fed to the input of this function and what will be the result of its execution.
Let's start with the arguments. Classic parsers use a separate tokenizer, which splits the input (text file or line) into a series of tokens, such as keywords, identifiers (names), numbers, and operators. PEG parsers (like other modern parsers like ANTLR) often combine tokenization and parsing, but for my project I decided to leave a separate tokenizer.
Python tokenization is quite complicated, so I donβt want to implement it on PEG rules. For example, you should track the indentation (this requires a stack inside the tokenizer); The processing of newlines in Python is also interesting (they are significant, except those enclosed in corresponding brackets). Many types of strings also cause some complexity. In short, I have no complaints about the existing Python tokenizer, so I want to leave it as it is. By the way, CPython has two tokenizers: the internal one, which is used by the parser, it is written in C, and the standard library one, which is an exact copy implemented in pure Python. This will come in handy in my project.
A classic tokenizer usually has a simple interface consisting of a single get_token()
function. Each time, it returns the next token in the input sequence, analyzing a group of characters. CPython's tokenize
module is no exception: its core API is a generator that issues one token at a time. Each token is an object of type TypeInfo
, which has several fields, the most important of which is the type of token (for example, NAME
, NUMBER
, STRING
) and its string value is the set of characters it consists of (for example, abc
, 42
or "Hello, world"
). There are also additional fields. For example, for the token index in the input stream, which is useful in error messages.
A special type of token is ENDMARKER
, which indicates that the end of the input file has been reached. The generator will fall if you ignore it and try to get the next token.
But I was distracted. How do we realize unlimited returns? Rolling back the token list requires you to be able to remember the position in the source code and reanalyze from that point. The tokenizer API does not allow us to move its pointer, but you can capture the stream of tokens into an array and play it from there, which we will do. You can also repeat this with itertools.tee()
, but this is probably less effective in our case, if you look at the warnings in the documentation.
I suppose you could just tokenize all the input to the list first, and then use it as input to the parser. But if there was an invalid token at the end of the file (for example, a line with an missing closing quote) and there was also a syntax error in the file, then you will first receive an error message from the tokenizer. I believe this is bad for the user, as a syntax error can be the main cause of an invalid line. So I have slightly different requirements for the tokenizer, in particular, it should be implemented as a lazy list.
The core API is very simple. The Tokenizer
object encapsulates an token array and position in that array. He has three main methods:
-
get_token()
returns the next token, shifting the pointer (or reading the next token from the source, if we are at the end of the token buffer); -
mark()
returns the current position in the buffer; -
reset(pos)
sets the position in the buffer (the argument must be obtained frommark()
).
We added one helper function peek_token()
, which returns the next token without shifting the position in the buffer.
This is what the base of the Tokenizer
class looks like:
class Tokenizer: def __init__(self, tokengen): """Call with tokenize.generate_tokens(...).""" self.tokengen = tokengen self.tokens = [] self.pos = 0 def mark(self): return self.pos def reset(self, pos): self.pos = pos def get_token(self): token = self.peek_token() self.pos += 1 return token def peek_token(self): if self.pos == len(self.tokens): self.tokens.append(next(self.tokengen)) return self.tokens[self.pos]
Something is omitted here for simplicity (for example, the names of methods and instance variables must begin with an underscore), but this is just a prototype of the Tokenizer
API.
The parser must also become a class so that statement()
, expr()
, etc. could be implemented as methods. The tokenizer will become an instance variable, but I donβt want the parser methods to directly call get_token()
- instead, we implement the wait()
method in the Parser
class, which can succeed or fail just like the parser method. The argument to the wait()
function is the expected token: either a string (for example, +
) or a type of token (for example, NAME
). The type of the return value is not important yet, I will return to it after discussing the result of the work of the parser.
Let the grammar rule functions return only True
or False
. This is good for theoretical computer science (there the parser answers the question βIs this a valid string in the language?β), But not for us. Our task is to create an AST. So, let's rewrite this code so that each analysis method returns a Node
object on success or None
on failure.
The Node
class can be very simple:
class Node: def __init__(self, type, children): self.type = type self.children = children
Here, type
determines the type of the AST node (for example, add
or if
), and the descendants are a list of nodes and tokens (instances of TokenInfo
). This is enough for the compiler to generate code or perform other analysis, such as linting or static type checking. Although in the future I would like to change the way AST is presented.
To fit into this scheme, the expect()
method must return a TokenInfo
object on success and None
on failure. To be able to roll back to previous tokens, I wrap the calls to the mark()
and reset()
methods of the tokenizer (here the API does not change). Here's what happened:
class Parser: def __init__(self, tokenizer): self.tokenizer = tokenizer def mark(self): return self.tokenizer.mark() def reset(self, pos): self.tokenizer.reset(pos) def expect(self, arg): token = self.tokenizer.peek_token() if token.type == arg or token.string == arg: return self.tokenizer.get_token() return None
Once again: I omitted some details, but this is already working code.
Now I need to introduce an important requirement for parser methods. Everyone should either return Node
, placing the tokenizer after the last token of the grammar rule that they recognized; either None
and leave the position of the tokenizer unchanged. If the parser method reads several tokens and then falls off, it must restore the position of the tokenizer. To do this, mark()
and reset()
are intended. Note that expect()
also obeys this rule.
So, here is a sketch of the actual parser. Here I use the walrus operator from Python 3.8 ( :=
):
class ToyParser(Parser): def statement(self): if a := self.assignment(): return a if e := self.expr(): return e if i := self.if_statement(): return i return None def expr(self): if t := self.term(): pos = self.mark() if op := self.expect("+"): if e := self.expr(): return Node("add", [t, e]) self.reset(pos) if op := self.expect("-"): if e := self.expr(): return Node("sub", [t, e]) self.reset(pos) return t return None def term(self): # Very similar... def atom(self): if token := self.expect(NAME): return token if token := self.expect(NUMBER): return token pos = self.mark() if self.expect("("): if e := self.expr(): if self.expect(")"): return e self.reset(pos) return None
I omitted the implementation of some methods so that the reader had the opportunity to practice themselves. This is really better than just reading about how such a parser is implemented. In the end, we will generate such code automatically from the grammar. Constants such as NAME
and NUMBER
are imported from the token
module of the standard library. This further ties us to the current implementation of the Python tokenizer. If we want to create a generalized PEG parser, then we must find ways to avoid this.
Also note that I'm a little cheated. The expr
method must be left recursive, but I made the parser right recursive, because the recursive descent parser does not work with left recursive grammar rules. This can be fixed, but it is still the topic of some scientific research, and I would like to talk about it separately. Just keep in mind that this implementation is not 100% consistent with our simplified grammar.
The key things that I want you to understand so far:
- Grammar rules correspond to parser methods, and when a grammar rule refers to another, it will call a method of another rule.
- When a sequence of tokens can be interpreted differently, the corresponding parser methods are invoked one after another.
- When a grammar rule refers to a token, the method calls the
expect()
function. - If the parser successfully recognizes its grammar rule at the current position, it returns the corresponding AST node; if he cannot recognize his grammar rule, he returns
None
. - Parser methods should explicitly reset the tokenizer position when they stop parsing after using one or more tokens (directly or indirectly, invoking another successful parsing method). This is applicable not only when one of the options is rejected in order to proceed to the next, but also when the analysis is rejected as a whole.
If all parsing methods obey these rules, then there is no need to wrap each in mark()
and reset()
calls. This can be proved by induction.
In addition, it is tempting to try to get rid of explicit calls to mark()
and reset()
using the context manager and the with
statement, but it will not work: you should not call reset()
if successful! As a further fix, you can try using exceptions for the control flow so that the context manager knows if the tokenizer should be reset (I think TatSu is doing something similar). For example, something like this:
def statement(self): with self.alt(): return self.assignment() with self.alt(): return self.expr() with self.alt(): return self.if_statement() raise ParsingFailure
In particular, a small ladder of if
in atom()
for recognizing an expression in brackets can be written as:
with self.alt(): self.expect("(") e = self.expr() self.expect(")") return e
But it seems to me too "magical" - when reading this code, you should remember that every parsing method (including wait()
) may throw an exception. And that this exception is caught and ignored by the context manager in the with
statement. This is rather unusual, albeit feasible (by returning True
from __exit__
). However, my ultimate goal is to generate code in C, not Python, and in C there is no with
statement to change the control flow.
In any case, here are a few topics for the following parts:
- generation of parser methods from grammar;
- packrat-parsing (memoization);
- EBNF features such as
(x | y)
,[xy ...]
,x*
,x+
; - trace (for debugging a parser or grammar);
- PEG features such as lookahead and cut;
- how to handle left recursive rules;
- C code generation
License for this article and cited code: CC BY-NC-SA 4.0