Some years ago, someone asked me whether it makes sense to convert Python to a PEG parser (or to a PEG grammar; I don’t remember exactly who and when it was). Then I looked at him a little, but I did not come to any conclusion, and therefore I dropped this topic. I recently learned more about PEG (Parsing Expression Grammars, an expression parsing grammar), and now I think this is an interesting alternative to the self-written parser generator that was developed 30 years ago when I was just starting to work on Python. I called it “pgen”, and this was probably the first piece of code I wrote for Python.
- Peg parsers
- PEG parser implementation
- PEG parser generation
- PEG parser visualization
- Left recursive peg grammar
- Adding Actions to PEG Grammar
- Meta grammar for PEG parser
- Implementing the rest of the PEG features
- PEG on Core Developer Sprint
The reason I'm currently interested in the PEG parser is because I am somewhat annoyed by the limitations of pgen. It is built on its own implementation of LL (1), which has a number of assumptions. For example, I didn’t like grammar rules that could generate empty lines, so I forbade them. And thereby simplified the algorithm for creating parsing tables. I also invented my own EBNF-like grammar notation, which I still really like.
Here are some problems with pgen that annoy me. The “1” in the name LL (1) implies that it only analyzes the next token, and this limits our ability to create good grammar rules. For example, a Python statement may be an expression or an assignment (or something else, but they all start with a highlighted keyword, such as if
or def
). I would like to describe the syntax using pgen notation. Note that this is just a simplified example, which is a subset of Python, as is usually done when describing language design. This will be a toy grammar that will come in handy for further demonstration.
statement: assignment | expr | if_statement expr: expr '+' term | expr '-' term | term term: term '*' atom | term '/' atom | atom atom: NAME | NUMBER | '(' expr ')' assignment: target '=' expr target: NAME if_statement: 'if' expr ':' statement
A few words about notation: NAME
and NUMBER
are tokens and are predefined outside the grammar. Quoted strings of type '+'
or 'if'
are also tokens. (Let's postpone talking about them next time) Grammar rules begin with the name of the rule, followed by :
and then one or more alternatives, separated by |
.
The problem is that if you describe the grammar this way, then pgen will not work. This is due to the fact that some rules ( expr
and term
) are left recursive, and he is not smart enough to correctly handle such situations. Usually this is solved by rewriting only these rules, leaving the other rules unchanged. For example:
expr: term ('+' term | '-' term)* term: atom ('*' atom | '/' atom)*
This reveals several possibilities of EBNF in pgen: you can enclose alternatives in parentheses and create repetitions by placing *
after the element. So the rule for the expression expr
here means "it is a term followed by zero or more repetitions of the sequence <term plus or minus, followed by term>". This grammar takes the same language as the first version, but again it does not reflect the design intent of the language. In particular, it does not show that the operators are bound on the left, and this is important when you are trying to generate code.
But there is another annoying problem in this example (and in Python too). Due to parsing only one token, the analyzer cannot determine what it is looking at - the beginning of an expression or assignment. At the very beginning of the processing of the operator, the parser should decide on only the first token, which alternative it parses. (Why? This is how the implementation of parsing in pgen works) Suppose our program is as follows:
answer = 42
This program is represented by three tokens: NAME
(with answer
value), =
and NUMBER
(with value 42
). The only thing we know about at the very beginning of the parsing is the first NAME
token. The rule that we are trying to parse at this stage is statement
(the initial character of the grammar). Three alternatives are defined for it: expr
, assignment
and if_statement
. We can immediately exclude if_statement
, because the previous token is not if
. But both expr
and assignment
can start with the NAME
token, and because of this, pgen rejects our grammar as ambiguous.
This is not entirely correct, since technically the grammar itself is not; I can’t find a more accurate formulation, so let's dwell on this one. And how does pgen solve this? It computes something called a FIRST set for each grammar rule, and if they intersect, then it throws an exception.
So, can't we solve this problem by providing the parser with a larger viewing buffer? For our example, a second preview token is sufficient, since in this grammar the second assignment token should be =
. But in a more realistic language such as Python, you may need an unlimited buffer, since the contents to the left of the =
token =
can be arbitrarily complex, for example:
table[index + 1].name.first = 'Steven'
In this case, you will have to parse ten tokens before we meet =
. But I assure you, there may be longer expressions. To solve this problem in pgen, we changed the grammar analyzer to accept some incorrect expressions, adding an additional check in a subsequent pass. It will throw a SyntaxError error if it cannot match the left and right sides. For our toy language, it comes down to writing the following:
statement: assignment_or_expr | if_statement assignment_or_expr: expr ['=' expr]
Square brackets indicate an optional part. And then on the next pass of the compiler (say, when generating bytecode) we check if there is =
or not, and if so, we check that the left side matches the target
syntax.
There is a similar problem for function call arguments. We would like to write something like this (again, in a simplified version of Python syntax):
call: atom '(' arguments ')' arguments: arg (',' arg) * arg: posarg | kwarg posarg: expr kwarg: NAME '=' expr
But the parsing algorithm, which would take into account only the next token, cannot tell the parser whether NAME
at the beginning of arguments is the beginning of posarg
(since expr
can begin with NAME
) or the beginning of kwarg
. Again, the current Python parser solves this problem by determining:
arg: expr ['=' expr]
and then completes the alternative in a subsequent compiler pass. (We even made a little mistake and parsed foo((a) = 1)
same way as foo(a = 1)
. This was fixed only in Python 3.8)
So how does a PEG parser solve these problems? Using an infinite backup buffer! Its typical implementation uses the so-called packrat parser, which not only loads the entire program into memory before parsing it, but also allows the analyzer to be rolled back to an arbitrary number of tokens back. Although the term PEG primarily refers to grammatical notation, parsers generated from PEG grammars are typically parsers with recursive descent and unlimited return. The packrat parser makes the ego efficient by remembering the rules that have already been parsed for specific positions.
This simplifies the algorithm, but of course there is a price: memory. Thirty years ago, I had a good reason to use LL (1): memory was expensive. He (like other technologies such as LALR (1), made famous by YACC) uses a state machine and stack to efficiently build a parse tree.
Fortunately, computers running CPython have a lot more memory than they did 30 years ago, and storing the entire file in memory is no longer a problem. For example, the largest non-test file in stdlib that I could find is _pydecimal.py
, which takes about 223kb. In the gigabyte world, this is essentially nothing, which made me take a different look at parsers.
But there is one more thing in the current CPython parser that bothers me. Compilers are complex things, and the implementation for CPython is no exception. Although the result of the pgen parser is a parsing tree, it is not directly used as input for the bytecode generator: it is first converted to an abstract syntax tree (AST), and only then this AST is compiled into bytecode. (In fact, it’s even more complicated there, but for now we will not go into details)
Why not immediately compile from a parse tree? That is exactly how it was, but about 15 years ago we discovered that the compiler was overcomplicated. So we isolated AST and the transformation phase of AST from the parse tree separately. As Python evolved, AST remained more stable than parsing, which reduced the likelihood of errors in the compiler.
AST is also easier to work with third-party libraries that want to test Python code. It can be obtained using the popular ast
module. It also allows you to create nodes from scratch and modify existing ones, as well as compile parts into bytecode. The latter allowed the creation of an entire industry of language extensions for Python. (The parse tree is also accessible to Python users through the parsing module, but working with it is much more difficult; therefore, it is not so popular)
Now I want to combine these things and see if we can create a new parser for CPython, which uses PEG and packrat to create an AST directly during parsing. Thus, it will be possible to omit the generation of an intermediate parsing tree, which can save us memory, even despite the use of an infinite buffer for tokens. I'm still in the process of implementation, but I already have a prototype that can compile a subset of Python into AST at about the same speed as the current CPython parser. However, it uses more memory, and it seems to me that expanding the subset into the full language will slow down the PEG parser. But so far I have not thought about any optimizations, so I will continue to work.
The last advantage of switching to PEG is that it provides more flexibility for the further evolution of the language. In the past, it was said that the LL (1) constraints in pgen kept Python grammar simple. This may very well be true, but we have many other processes to prevent uncontrolled language growth (mainly the PEP process, which is helped by very strict backward compatibility requirements and a new management structure). So I'm not worried about that.
I would like to tell a lot more about PEG and my implementation, but it will be in the next posts after I clean the code.
License for this article and cited code: CC BY-NC-SA 4.0