Hello, Habr! I present to you the translation of the article "Learning Parser Combinators With Rust" .
This article teaches the basics of combinatorial parsers for people who are already familiar with Rust. It is assumed that no other knowledge is required, and everything that is not directly related to Rust, as well as some unexpected aspects of its use, will be explained. This article will not help you learn Rust if you do not already know it, and in this case, you most likely will not understand combinatorial parsers well. If you want to learn Rust, I recommend the book "Rust Programming Language" .
In the life of every programmer, there comes a time when he needs a parser.
A novice programmer will ask: "What is a parser?"
A mid-level programmer will say: "It's simple, I will write a regular expression."
The master programmer will say: “Get away, I know lex and yacc.”
A novice thinks best of all.
Not that regular expressions are bad. (But please, do not try to write a complex parser as a regular expression.) Not that there is no joy in using powerful tools such as parser and lexer generators that have been honed to perfection for millennia. But learning the basics of parsers is a pleasure . This is also what you will miss if you have panic when using regular expressions or parser generators, both of which are just abstractions over the real problem. In the mind of a beginner, as a man said , there are many possibilities. According to the expert, there is only one correct option.
In this article, we will learn how to build a parser from scratch with
using a method common in functional programming languages known as combinatorial parser . They have the advantage of being surprisingly powerful as soon as you grasp their basic idea, while remaining very close to the basic principle. Because the only abstractions here will be those that you create yourself. The main combinators that you will use here you will build yourself.
It is strongly recommended that you start with a new Rust project and write code snippets in src/lib.rs
as you read it (you can paste it directly from the page, but enter it better, as this automatically guarantees that you will read it completely). All parts of the code that you need are arranged in order in the article. Keep in mind that sometimes modified versions of functions that you wrote earlier are introduced, and in these cases you must replace the old version with the new one.
The code was written for rustc
version 1.34.0 using the 2018
language edition. You can use any version of the compiler, make sure that you are using an edition that supports the 2018
edition (make sure your Cargo.toml
contains edition = "2018"
). This project does not need external dependencies.
To perform the tests presented in the article, as expected, cargo test
.
We are going to write a parser for a simplified version of XML. It looks like this:
<parent-element> <single-element attribute="value" /> </parent-element>
XML elements are opened with the <
character and an identifier consisting of a letter followed by any number of letters, numbers, and -
. This is followed by spaces and an optional list of attribute pairs: another identifier as defined earlier, followed by =
and a string in double quotes. Finally, there is either a closing />
symbol to indicate one element without children, or >
to indicate the existence of the next sequence of child elements, and a closing tag starting with </
, followed by an identifier that must match the opening tag, and a closing symbol >
.
That is all we will support. There will be no namespaces, no text nodes, and most certainly there will be no schema check. We don’t even have to worry about escaping quotation marks for strings — they start with the first double quotation mark, and they end with the next one, and that’s it.
We are going to parse these elements into a structure that looks like this:
#[derive(Clone, Debug, PartialEq, Eq)] struct Element { name: String, attributes: Vec<(String, String)>, children: Vec<Element>, }
There are no fancy types, just a string for the name (this is the identifier at the beginning of each tag), attributes in the form of pairs of strings (identifier and value) and a list of child elements that look exactly like the parent.
(If you are printing, be sure to include the derive
section. You will need it later.)
Well then, it's time to write a parser.
Parsing is the process of obtaining structured data from a stream of information. A parser is what brings out this structure.
In the discipline we are about to research, the parser, in its simplest form, is a function that takes some input and returns either the parsed output along with the rest of the input, or an error saying "I could not parse this."
The parser also looks in its more complex forms. You can complicate what input, output, and error mean, and if you need good error messages that you need, but the parser remains the same: something that consumes input and produces some sort of parsed output along with what left over from input or lets you know that it cannot parse input to output.
Let's write it as a type of function.
Fn(Input) -> Result<(Input, Output), Error>
More specifically: in our case, we want to fill in the types and get a function similar to this one. All we are going to do is convert the string to an Element
structure. At the moment, we don’t want to go into the details of error messages, so just return the part of the line that we could not parse. As a result, our function will look like this:
Fn(&str) -> Result<(&str, Element), &str>
We use a string slice, because it is an effective pointer to a fragment of a string, and then we can separate it as we like by “consuming” the input data, cutting out the analyzed fragment and returning the remainder along with the result.
It might be cleaner to use &[u8]
(a slice of bytes matching ASCII characters) as an input type, especially because string slices behave a little differently than most slices - especially since you cannot index them with one numbers input[0]
, you should use the fragment input[0..1]
. On the other hand, they have many methods that are useful for parsing strings that do not have byte slices.
In fact, we will rely on methods, not the use of character indexes, because Unicode. In UTF-8, and all Rust strings are valid UTF-8 strings, these indices do not always correspond to individual characters, and it is better for all interested parties to ask the standard library to just work with it for us.
Let's try writing a parser that just looks at the first character in a string
and decides whether it is the letter a
.
fn the_letter_a(input: &str) -> Result<(&str, ()), &str> { match input.chars().next() { Some('a') => Ok((&input['a'.len_utf8()..], ())), _ => Err(input), } }
First, let's look at the types of input and output: we take a string slice as input and we return Result
with (&str, ())
, or an error with &str
. The pair (&str, ())
is an interesting point: as we said that we should return a tuple of the following content,
parsing result and the rest of the input. &str
is the next input, and the result is just a block type ()
, because if this parser works successfully, it can have only one result (we found the letter a
), and we don’t really need to return
the letter a
in this case we just need to indicate that we were able to find it.
So, let's look at the parser code itself. We were not joking that relying on the standard library, we can avoid the headache with Unicode: we get an iterator over the characters of the string using the chars()
method and take the first character from it. This will be an element of type char
wrapped in Option
, in which None
will mean that we were trying to get char
out of an empty string.
To make matters worse, char
not necessarily even what you think of it as a Unicode character. This is likely to be what Unicode calls a “grapheme cluster” , which can consist of several char
, which are actually “scalar values” , which are two levels below grapheme clusters. However, this path leads to insanity, and for our purposes, we honestly are unlikely to see chars
outside of ASCII, so let's stop there.
We map to the Some('a')
pattern, which is the specific result we are looking for, and if that matches, we return our success value:
Ok((&input['a'.len_utf8()..], ()))
. That is, we remove the part that we just parsed ( 'a'
) from the line slice and return the rest along with our parsed value, which is just empty
()
. Always remembering the Unicode monster, before cutting, we find out the length in UTF-8 of the character 'a'
through the standard library - it is 1 (but never forget about the Unicode monster).
If we get any other Some(char)
, or if we get None
, then we return an error. As you recall, our type of error is simply a string slice, which we passed as input
and which could not be parsed. It did not start with a
, so this is our mistake. This is not a big mistake, but at least it's a little better than just “something is wrong somewhere.”
We don’t really need this parser to parse the XML, but the first thing we will have to do is find the opening character <
, so we need something very similar. We will also need to parse >
, /
and =
specifically, so maybe we can make a function that builds the parser for the character we want?
Let's even think about this: we will write a function that creates a parser for a static string of any length , and not just one character. This is even simpler because the line fragment is already a valid UTF-8 line slice, and we don’t need to think about the Unicode monster.
fn match_literal(expected: &'static str) -> impl Fn(&str) -> Result<(&str, ()), &str> { move |input| match input.get(0..expected.len()) { Some(next) if next == expected => { Ok((&input[expected.len()..], ())) } _ => Err(input), } }
Now it looks a little different.
First of all, let's look at the types. Now our function takes the expected
string as an argument and returns something similar to a parser, instead of being a parser itself. This is a function that returns a higher order function . Essentially, we are writing a function that makes a function similar to our the_letter_a
function earlier.
Thus, instead of doing work in the body of the function, we return a closure that performs this work and which corresponds to the signature of our type for the analyzer from the previous one.
Pattern matching looks the same, except that we cannot match our string literal directly, because we don’t know what exactly, therefore we use the matching condition if next == expected
. Otherwise, it is exactly the same as before, just inside the body of the circuit.
Let's write a test for this to make sure we get it right.
#[test] fn literal_parser() { let parse_joe = match_literal("Hello Joe!"); assert_eq!( Ok(("", ())), parse_joe("Hello Joe!") ); assert_eq!( Ok((" Hello Robert!", ())), parse_joe("Hello Joe! Hello Robert!") ); assert_eq!( Err("Hello Mike!"), parse_joe("Hello Mike!") ); }
First we create a parser: match_literal("Hello Joe!")
. It should consume the string "Hello Joe!"
and return the rest of the string, or fail and return the entire string.
In the first case, we just pass it the exact row that we expect, and see that it returns an empty string and the value ()
, which means "we have analyzed the expected string, and you do not need it to be returned."
In the second, we feed the string "Hello Joe! Hello Robert!"
and we see that he really uses the string "Hello Joe!"
and returns the remainder of the input: " Hello Robert!"
(leading space and everything else).
In the third, we enter the wrong input: "Hello Mike!"
and note that it really rejects input with an error. Not that Mike
is wrong, usually just not what the parser was looking for.
So this allows us to analyze <
, >
, =
and even </
and />
. We are almost done!
The next after the opening character <
is the name of the element. We cannot do this with a simple string comparison. But we could do it with regex ...
... but let's hold back. This will be a regular expression that will be very easy to replicate in simple code, and for this we do not need to use the regex
package. Let's see if we can write our own parser for this, using only the standard Rust library.
Recall the rule for the identifier of the element name, it looks like this: one alphabetical character, followed by zero or more of any alphabetical character,
character, number or dash.
fn identifier(input: &str) -> Result<(&str, String), &str> { let mut matched = String::new(); let mut chars = input.chars(); match chars.next() { Some(next) if next.is_alphabetic() => matched.push(next), _ => return Err(input), } while let Some(next) = chars.next() { if next.is_alphanumeric() || next == '-' { matched.push(next); } else { break; } } let next_index = matched.len(); Ok((&input[next_index..], matched)) }
As always, we first look at the type. This time we do not write a function to create a parser, we just write the parser itself, as for the first time. The noticeable difference here is that instead of the result type ()
we return a String
in the tuple along with the remaining input. This String
will contain the identifier we just analyzed.
With that in mind, we first create an empty String
and call it matched
. This will be our result. We also get an iterator for the characters in input
, which we are going to start splitting.
The first step is to see if there is a character at the beginning. We pull the first character from the iterator and check if it is a letter: next.is_alphabetic()
. The standard Rust library, of course, will help us with Unicode - it will correspond to letters in any alphabet, and not just in ASCII. If it is a letter, we put it in our string in the matched
variable, and if not, we do not look at the identifier of the element and immediately return with an error.
In the second step, we continue to pull the characters from the iterator, sending them to the line we create until we is_alphanumeric()
character that does not satisfy the is_alphanumeric()
function (it looks like is_alphabetic()
except that it also includes numbers in any alphabet) or dash '-'
.
When we first see something that does not meet these criteria, we finish the parsing, break the loop, and return a String
, remembering to cut the fragment we used from input
. Similarly, if characters end in an iterator, it means that we have reached the end of the input.
It is worth noting that we do not return with an error when we see something that is not alphanumeric or dashes. We already have enough to create a valid identifier after we match this first letter, and it is quite normal that after parsing our identifier there will be more elements in the input line for analysis, so we just stop the parsing and return our result. Only if we cannot find even this first letter, we actually return an error, because in this case there was definitely no identifier.
Remember that the Element
structure is what we are going to parse our XML document into.
struct Element { name: String, attributes: Vec<(String, String)>, children: Vec<Element>, }
In fact, we just finished the analyzer for the first part, the name
field. String
that our parser returns goes directly there. It is also the correct parser for the first part of each attribute
.
Let's check it out.
#[test] fn identifier_parser() { assert_eq!( Ok(("", "i-am-an-identifier".to_string())), identifier("i-am-an-identifier") ); assert_eq!( Ok((" entirely an identifier", "not".to_string())), identifier("not entirely an identifier") ); assert_eq!( Err("!not at all an identifier"), identifier("!not at all an identifier") ); }
We see that in the first case, the string "i-am-an-identifier"
parsed completely, leaving only an empty string. In the second case, the parser returns "not"
as an identifier, and the rest of the string is returned as the remaining input. In the third case, the parser fails immediately, because the first character it finds is not a letter.
Now we can parse the opening <
character, and we can parse the next identifier, but we need to parse both . Therefore, the next step is to write another combinatorial parser function, but one that takes two parsers as input and returns a new parser that parses them both in order. In other words, a parser combinator because it combines the two parsers into a new one. Let's see if we can do this.
fn pair<P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Fn(&str) -> Result<(&str, (R1, R2)), &str> where P1: Fn(&str) -> Result<(&str, R1), &str>, P2: Fn(&str) -> Result<(&str, R2), &str>, { move |input| match parser1(input) { Ok((next_input, result1)) => match parser2(next_input) { Ok((final_input, result2)) => Ok((final_input, (result1, result2))), Err(err) => Err(err), }, Err(err) => Err(err), } }
It gets a little trickier here, but you know what to do: start by looking at types.
First of all, we have four types of variables: P1
, P2
, R1
and R2
. These are Parser 1
, Parser 2
, Result 1
and Result 2
. P1
and P2
are functions, and you will notice that they follow the well-established pattern of parser functions: just like the return value, they take &str
as input and return Result
pair of remaining input and result, or an error.
But look at the result types of each function: P1
is the parser that gives R1
if successful, and P2
also gives R2
. And the result of the last parser that was returned from our function is (R1, R2)
. Thus, the task of this parser is to first start the analyzer P1
at the input, save its result, then run P2
at the input that returned P1
, and if both of them worked, we combine the two results into a tuple (R1, R2)
The code shows that this is exactly what he does. We start by starting the first analyzer at the input, then the second analyzer, then combine the two results into a tuple and return it. If any of these parsers fails, we will immediately return with the error that it issued.
Thus, we must be able to combine our two parsers, match_literal
and identifier
, in order to actually match_literal
first fragment of our first XML tag. Let's write a test to see if it works.
#[test] fn pair_combinator() { let tag_opener = pair(match_literal("<"), identifier); assert_eq!( Ok(("/>", ((), "my-first-element".to_string()))), tag_opener("<my-first-element/>") ); assert_eq!(Err("oops"), tag_opener("oops")); assert_eq!(Err("!oops"), tag_opener("<!oops")); }
Seems to work! But look at this type of result: ((), String)
. Obviously, we are only interested in the right value, String
. This will happen quite often - some of our parsers only match patterns in the input without producing values, and so their output can be safely ignored. To adapt to this pattern, we are going to use our pair
combinator to write two other combinators: left
, which discards the result of the first parser and returns only the second, and its opposite is right
. We used a parser in our test above, which discards the left side and saves only our String
, instead of pair
.
But before we go that far, let's introduce another combinator that will greatly simplify the writing of these two: map
.
This combinator has one task: to change the type of result. , , , ((), String)
, String
.
, , . : |(_left, right)| right
. , Fn(A) -> B
A
— , B
— .
fn map<P, F, A, B>(parser: P, map_fn: F) -> impl Fn(&str) -> Result<(&str, B), &str> where P: Fn(&str) -> Result<(&str, A), &str>, F: Fn(A) -> B, { move |input| match parser(input) { Ok((next_input, result)) => Ok((next_input, map_fn(result))), Err(err) => Err(err), } }
? P
— . A
. F
— , P
, , P
, — B
A
parser(input)
, , result
map_fn(result)
, A
B
, .
, , map
, Result
:
fn map<P, F, A, B>(parser: P, map_fn: F) -> impl Fn(&str) -> Result<(&str, B), &str> where P: Fn(&str) -> Result<(&str, A), &str>, F: Fn(A) -> B, { move |input| parser(input) .map(|(next_input, result)| (next_input, map_fn(result))) }
— , «» Haskell , . A
, map
, A
B
, B
, . Rust, Option
, Result
, Iterator
Future
, . : - Rust, , , , , map
.
, , : Fn(&str) -> Result<(&str, Output), &str>
. , , , , , , , .
, :
type ParseResult<'a, Output> = Result<(&'a str, Output), &'a str>;
, , , ParseResult<String>
. , , Rust . , rustc
, , .
'a
, , .
. , , . , .
trait Parser<'a, Output> { fn parse(&self, input: &'a str) -> ParseResult<'a, Output>; }
: parse()
, : , , .
, , :
impl<'a, F, Output> Parser<'a, Output> for F where F: Fn(&'a str) -> ParseResult<Output>, { fn parse(&self, input: &'a str) -> ParseResult<'a, Output> { self(input) } }
, , , Parser
, .
, , . map
, .
fn map<'a, P, F, A, B>(parser: P, map_fn: F) -> impl Parser<'a, B> where P: Parser<'a, A>, F: Fn(A) -> B, { move |input| parser.parse(input) .map(|(next_input, result)| (next_input, map_fn(result))) }
: , parser.parse(input)
, , P
, , Parser
, , Parser
. , . 'a'
, , .
pair
, :
fn pair<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, (R1, R2)> where P1: Parser<'a, R1>, P2: Parser<'a, R2>, { move |input| match parser1.parse(input) { Ok((next_input, result1)) => match parser2.parse(next_input) { Ok((final_input, result2)) => Ok((final_input, (result1, result2))), Err(err) => Err(err), }, Err(err) => Err(err), } }
: parser.parse(input)
parser(input)
.
, pair
, map
.
fn pair<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, (R1, R2)> where P1: Parser<'a, R1>, P2: Parser<'a, R2>, { move |input| { parser1.parse(input).and_then(|(next_input, result1)| { parser2.parse(next_input) .map(|(last_input, result2)| (last_input, (result1, result2))) }) } }
and_then
Result
map
, , Result
, Result
. match
. and_then
, , map
, left
right
.
pair
map
, left
right
:
fn left<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, R1> where P1: Parser<'a, R1>, P2: Parser<'a, R2>, { map(pair(parser1, parser2), |(left, _right)| left) } fn right<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, R2> where P1: Parser<'a, R1>, P2: Parser<'a, R2>, { map(pair(parser1, parser2), |(_left, right)| right) }
pair
, , map
, .
, , .
, Parser
ParseResult
. match_literal
:
fn match_literal<'a>(expected: &'static str) -> impl Parser<'a, ()> { move |input: &'a str| match input.get(0..expected.len()) { Some(next) if next == expected => Ok((&input[expected.len()..], ())), _ => Err(input), } }
, , — &'a str
, rustc
.
identifier
, , , :
fn identifier(input: &str) -> ParseResult<String> {
. ()
. .
#[test] fn right_combinator() { let tag_opener = right(match_literal("<"), identifier); assert_eq!( Ok(("/>", "my-first-element".to_string())), tag_opener.parse("<my-first-element/>") ); assert_eq!(Err("oops"), tag_opener.parse("oops")); assert_eq!(Err("!oops"), tag_opener.parse("<!oops")); }
. <
. What's next? .
, , . , .
, , - , : .
( ) . .
— , <element attribute="value"/>
, . , , .
identifier
, . , .
fn one_or_more<'a, P, A>(parser: P) -> impl Parser<'a, Vec<A>> where P: Parser<'a, A>, { move |mut input| { let mut result = Vec::new(); if let Ok((next_input, first_item)) = parser.parse(input) { input = next_input; result.push(first_item); } else { return Err(input); } while let Ok((next_input, next_item)) = parser.parse(input) { input = next_input; result.push(next_item); } Ok((input, result)) } }
, , , A
, Vec<A>
— A
.
identifier
. , , . , , .
, : ? :
fn zero_or_more<'a, P, A>(parser: P) -> impl Parser<'a, Vec<A>> where P: Parser<'a, A>, { move |mut input| { let mut result = Vec::new(); while let Ok((next_input, next_item)) = parser.parse(input) { input = next_input; result.push(next_item); } Ok((input, result)) } }
, , .
#[test] fn one_or_more_combinator() { let parser = one_or_more(match_literal("ha")); assert_eq!(Ok(("", vec![(), (), ()])), parser.parse("hahaha")); assert_eq!(Err("ahah"), parser.parse("ahah")); assert_eq!(Err(""), parser.parse("")); } #[test] fn zero_or_more_combinator() { let parser = zero_or_more(match_literal("ha")); assert_eq!(Ok(("", vec![(), (), ()])), parser.parse("hahaha")); assert_eq!(Ok(("ahah", vec![])), parser.parse("ahah")); assert_eq!(Ok(("", vec![])), parser.parse("")); }
: one_or_more
, , zero_or_more
, .
, , . one_or_more
zero_or_more
, - :
fn one_or_more<'a, P, A>(parser: P) -> impl Parser<'a, Vec<A>> where P: Parser<'a, A>, { map(pair(parser, zero_or_more(parser)), |(head, mut tail)| { tail.insert(0, head); tail }) }
Rust, cons
Vec
, , Lisp, , . , : .
, : , . : ( ). , , Clone
, , , .
. , one_or_more
, , , , , - , , RangeBound
: range(0..)
zero_or_more
, range(1..)
one_or_more
, range(5..=6)
, .
. zero_or_more
one_or_more
.
, — , Rc
?
, one_or_more
zero_or_more
.
, . , . , , , >
/>
. , . , , , , .
. .
Firstly. match_literal
, . ? - , Unicode, . Rust, , char
is_whitespace
, is_alphabetic
is_alphanumeric
.
Secondly. , , is_whitespace
, identifier
.
Thirdly. , . any_char
, char
, , pred
, , : pred(any_char, |c| c.is_whitespace())
. , , : .
any_char
, , UTF-8.
fn any_char(input: &str) -> ParseResult<char> { match input.chars().next() { Some(next) => Ok((&input[next.len_utf8()..], next)), _ => Err(input), } }
pred
, , . , . , true
, . , .
fn pred<'a, P, A, F>(parser: P, predicate: F) -> impl Parser<'a, A> where P: Parser<'a, A>, F: Fn(&A) -> bool, { move |input| { if let Ok((next_input, value)) = parser.parse(input) { if predicate(&value) { return Ok((next_input, value)); } } Err(input) } }
, , :
#[test] fn predicate_combinator() { let parser = pred(any_char, |c| *c == 'o'); assert_eq!(Ok(("mg", 'o')), parser.parse("omg")); assert_eq!(Err("lol"), parser.parse("lol")); }
, whitespace_char
:
fn whitespace_char<'a>() -> impl Parser<'a, char> { pred(any_char, |c| c.is_whitespace()) }
, whitespace_char
, , , , , . space1
space0
.
fn space1<'a>() -> impl Parser<'a, Vec<char>> { one_or_more(whitespace_char()) } fn space0<'a>() -> impl Parser<'a, Vec<char>> { zero_or_more(whitespace_char()) }
? , , . identifier
( , any_char
pred
*_or_more
). match_literal("=")
=
. , . , , .
fn quoted_string<'a>() -> impl Parser<'a, String> { map( right( match_literal("\""), left( zero_or_more(pred(any_char, |c| *c != '"')), match_literal("\""), ), ), |chars| chars.into_iter().collect(), ) }
, , , , .
map
- , , , , , : . map
right
, right
— , : match_literal("\"")
. .
right
— . left
, , left
, , , match_literal("\"")
— . , — .
pred
any_char
, , , zero_or_more
, :
right
left
.
, . , zero_or_more
? Vec<A>
A
. any_char
char
. , , Vec<char>
. map
: , Vec<char>
String
, , String
Iterator<Item = char>
, vec_of_chars.into_iter().collect()
, String
.
, , , , , , , , .
#[test] fn quoted_string_parser() { assert_eq!( Ok(("", "Hello Joe!".to_string())), quoted_string().parse("\"Hello Joe!\"") ); }
, , , , .
, , =
. , .
. , Vec<(String, String)>
, (String, String)
, zero_or_more
. , .
fn attribute_pair<'a>() -> impl Parser<'a, (String, String)> { pair(identifier, right(match_literal("="), quoted_string())) }
! : pair
, identifier
, String
, right
=
, , quoted_string
, String
.
zero_or_more
, , .
fn attributes<'a>() -> impl Parser<'a, Vec<(String, String)>> { zero_or_more(right(space1(), attribute_pair())) }
: ,
. right
.
.
#[test] fn attribute_parser() { assert_eq!( Ok(( "", vec![ ("one".to_string(), "1".to_string()), ("two".to_string(), "2".to_string()) ] )), attributes().parse(" one=\"1\" two=\"2\"") ); }
! !
, , rustc , , . , . , rustc, , , #![type_length_limit = "…some big number…"]
, . , #![type_length_limit = "16777216"]
, . , !
, , , , NP-. element: , , , zero_or_more
, ?
, , . , , , : <
, . , (String, Vec<(String, String)>)
.
fn element_start<'a>() -> impl Parser<'a, (String, Vec<(String, String)>)> { right(match_literal("<"), pair(identifier, attributes())) }
, , .
fn single_element<'a>() -> impl Parser<'a, Element> { map( left(element_start(), match_literal("/>")), |(name, attributes)| Element { name, attributes, children: vec![], }, ) }
, , — Element
!
.
#[test] fn single_element_parser() { assert_eq!( Ok(( "", Element { name: "div".to_string(), attributes: vec![("class".to_string(), "float".to_string())], children: vec![] } )), single_element().parse("<div class=\"float\"/>") ); }
… , .
single_element
, , , . , , , — , — .
, , ...
- Rust, , , .
. , :
enum List<A> { Cons(A, List<A>), Nil, }
rustc , List<A>
, List::<A>::Cons
List<A>
, . rustc, , , .
, Rust. , Rust, , , , . , .
, . , List::Cons
A
A
, A
A
. , , , , List::Cons
, . , Rust — Box
.
enum List<A> { Cons(A, Box<List<A>>), Nil, }
Box
, . , , Box<dyn Parser<'a, A>>
.
. ? , - , , . : . . , , SIMD ( ).
, Parser
Box , .
struct BoxedParser<'a, Output> { parser: Box<dyn Parser<'a, Output> + 'a>, } impl<'a, Output> BoxedParser<'a, Output> { fn new<P>(parser: P) -> Self where P: Parser<'a, Output> + 'a, { BoxedParser { parser: Box::new(parser), } } } impl<'a, Output> Parser<'a, Output> for BoxedParser<'a, Output> { fn parse(&self, input: &'a str) -> ParseResult<'a, Output> { self.parser.parse(input) } }
, , BoxedParser
box
. BoxedParser
( BoxedParser
, ), BoxedParser::new(parser)
, , Box
. , Parser
, .
Box
, BoxedParser
Parser
, . , , , , , , , , . .
, , , .
, ? , , . quoted_string
:
fn quoted_string<'a>() -> impl Parser<'a, String> { map( right( match_literal("\""), left( zero_or_more(pred(any_char, |c| *c != '"')), match_literal("\""), ), ), |chars| chars.into_iter().collect(), ) }
, . Parser
?
, , impl Trait
, impl Trait
.
… BoxedParser
. , impl Parser<'a, A>
, , BoxedParser<'a, A>
.
, , , Parser
.
map
, Parser
:
trait Parser<'a, Output> { fn parse(&self, input: &'a str) -> ParseResult<'a, Output>; fn map<F, NewOutput>(self, map_fn: F) -> BoxedParser<'a, NewOutput> where Self: Sized + 'a, Output: 'a, NewOutput: 'a, F: Fn(Output) -> NewOutput + 'a, { BoxedParser::new(map(self, map_fn)) } }
'a
, , . , — , , impl Trait
.
quoted_string
:
fn quoted_string<'a>() -> impl Parser<'a, String> { right( match_literal("\""), left( zero_or_more(pred(any_char, |c| *c != '"')), match_literal("\""), ), ) .map(|chars| chars.into_iter().collect()) }
, , .map()
right()
.
pair
, left
right
, , , , pair
. , , map
, .
— pred
. Parser
:
fn pred<F>(self, pred_fn: F) -> BoxedParser<'a, Output> where Self: Sized + 'a, Output: 'a, F: Fn(&Output) -> bool + 'a, { BoxedParser::new(pred(self, pred_fn)) }
quoted_string
pred
, :
zero_or_more(any_char.pred(|c| *c != '"')),
, , , zero_or_more
— « any_char
», . , zero_or_more
one_or_more
.
quoted_string
, map
single_element
:
fn single_element<'a>() -> impl Parser<'a, Element> { left(element_start(), match_literal("/>")).map(|(name, attributes)| Element { name, attributes, children: vec![], }) }
element_start
, , , . ...
… , . , .
map
pred
Box
— !
. single_element
, , >
, />
. , , .
fn open_element<'a>() -> impl Parser<'a, Element> { left(element_start(), match_literal(">")).map(|(name, attributes)| Element { name, attributes, children: vec![], }) }
, ? , , , zero_or_more
, ? , , — : , , .
, , : , , . , , , . , , , , , , .
fn either<'a, P1, P2, A>(parser1: P1, parser2: P2) -> impl Parser<'a, A> where P1: Parser<'a, A>, P2: Parser<'a, A>, { move |input| match parser1.parse(input) { ok @ Ok(_) => ok, Err(_) => parser2.parse(input), } }
element
, , ( open_element
, element
).
fn element<'a>() -> impl Parser<'a, Element> { either(single_element(), open_element()) }
. , , , . , ?
fn close_element<'a>(expected_name: String) -> impl Parser<'a, String> { right(match_literal("</"), left(identifier, match_literal(">"))) .pred(move |name| name == &expected_name) }
pred
, ?
, :
fn parent_element<'a>() -> impl Parser<'a, Element> { pair( open_element(), left(zero_or_more(element()), close_element(…oops)), ) }
close_element
? , .
. , parent_element
, open_element
element
parent_element
, , XML.
, , and_then
? , . and_then
: -, , , , . pair
, , , , . , and_then
Result
Option
, , , Result
Option
, , ( , )).
, .
fn and_then<'a, P, F, A, B, NextP>(parser: P, f: F) -> impl Parser<'a, B> where P: Parser<'a, A>, NextP: Parser<'a, B>, F: Fn(A) -> NextP, { move |input| match parser.parse(input) { Ok((next_input, result)) => f(result).parse(next_input), Err(err) => Err(err), } }
, , P
, , A
. F
, map
A
B
, , and_then
A
NextP
, B
. — B
, , , NextP
.
: , , , , f
( A
), , f(result)
, B
. . , , , B
: P
, , f
P
, NextP
, , .
Parser
, map
, .
fn and_then<F, NextParser, NewOutput>(self, f: F) -> BoxedParser<'a, NewOutput> where Self: Sized + 'a, Output: 'a, NewOutput: 'a, NextParser: Parser<'a, NewOutput> + 'a, F: Fn(Output) -> NextParser + 'a, { BoxedParser::new(and_then(self, f)) }
, , ?
, pair
:
fn pair<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, (R1, R2)> where P1: Parser<'a, R1> + 'a, P2: Parser<'a, R2> + 'a, R1: 'a + Clone, R2: 'a, { parser1.and_then(move |result1| parser2.map(move |result2| (result1.clone(), result2))) }
, : parser2.map()
parser2
, Fn
, FnOnce
, parser2
, . , Rust. , , pair
.
, Rust, close_element
, , .
:
fn parent_element<'a>() -> impl Parser<'a, Element> { pair( open_element(), left(zero_or_more(element()), close_element(…oops)), ) }
, and_then
, close_element
.
fn parent_element<'a>() -> impl Parser<'a, Element> { open_element().and_then(|el| { left(zero_or_more(element()), close_element(el.name.clone())).map(move |children| { let mut el = el.clone(); el.children = children; el }) }) }
, and_then
open_element()
, , close_element
. , open_element
and_then
. , Element
open_element
, , , .
, map
, Element
( el
) . clone()
, Fn
. ( Vec<Element>
) Element
, .
, , element
, open_element
parent_element
, , , , !
, , map
«» Haskell?
and_then
— , Rust, , map
. flat_map
Iterator
, , .
— "". Thing<A>
, and_then
, A
Thing<B>
, Thing<B>
— .
, , Option<A>
, , Some(A)
None
, , Some(A)
, Some(B)
. , Future<A>
, and_then
Future<B>
, Future<B>
Future<A>
, Future<A>
. Future<A>
, — 1 , Future<B>
. , Future
, and_then
, Future
, . , Future
, .
, Rust , , , , , , , . .
.
, XML, . , ( , , < div / >
, ).
.
fn whitespace_wrap<'a, P, A>(parser: P) -> impl Parser<'a, A> where P: Parser<'a, A>, { right(space0(), left(parser, space0())) }
element
, element
, , , .
fn element<'a>() -> impl Parser<'a, Element> { whitespace_wrap(either(single_element(), parent_element())) }
, ! , .
#[test] fn xml_parser() { let doc = r#" <top label="Top"> <semi-bottom label="Bottom"/> <middle> <bottom label="Another bottom"/> </middle> </top>"#; let parsed_doc = Element { name: "top".to_string(), attributes: vec![("label".to_string(), "Top".to_string())], children: vec![ Element { name: "semi-bottom".to_string(), attributes: vec![("label".to_string(), "Bottom".to_string())], children: vec![], }, Element { name: "middle".to_string(), attributes: vec![], children: vec![Element { name: "bottom".to_string(), attributes: vec![("label".to_string(), "Another bottom".to_string())], children: vec![], }], }, ], }; assert_eq!(Ok(("", parsed_doc)), element().parse(doc)); }
, - , , :
#[test] fn mismatched_closing_tag() { let doc = r#" <top> <bottom/> </middle>"#; assert_eq!(Err("</middle>"), element().parse(doc)); }
, . , , , , . , , , , . , , , , , .
: , ! , , , 2 .
, , . !
, pom
, , .
Rust nom
, pom
( , ), , .
Rust combine
, .
Haskell Parsec .
, « Haskell» , , Haskell.
Bodil Stokke Creative Commons Attribution-NonCommercial-ShareAlike 4.0. , http://creativecommons.org/licenses/by-nc-sa/4.0/ .
1: .
2: , .