Exploring combinatorial parsers with Rust

Hello, Habr! I present to you the translation of the article "Learning Parser Combinators With Rust" .

This article teaches the basics of combinatorial parsers for people who are already familiar with Rust. It is assumed that no other knowledge is required, and everything that is not directly related to Rust, as well as some unexpected aspects of its use, will be explained. This article will not help you learn Rust if you do not already know it, and in this case, you most likely will not understand combinatorial parsers well. If you want to learn Rust, I recommend the book "Rust Programming Language" .

From a beginner's point of view

In the life of every programmer, there comes a time when he needs a parser.

A novice programmer will ask: "What is a parser?"

A mid-level programmer will say: "It's simple, I will write a regular expression."

The master programmer will say: “Get away, I know lex and yacc.”

A novice thinks best of all.

Not that regular expressions are bad. (But please, do not try to write a complex parser as a regular expression.) Not that there is no joy in using powerful tools such as parser and lexer generators that have been honed to perfection for millennia. But learning the basics of parsers is a pleasure . This is also what you will miss if you have panic when using regular expressions or parser generators, both of which are just abstractions over the real problem. In the mind of a beginner, as a man said , there are many possibilities. According to the expert, there is only one correct option.

In this article, we will learn how to build a parser from scratch with

using a method common in functional programming languages known as combinatorial parser . They have the advantage of being surprisingly powerful as soon as you grasp their basic idea, while remaining very close to the basic principle. Because the only abstractions here will be those that you create yourself. The main combinators that you will use here you will build yourself.

How to work with this article

It is strongly recommended that you start with a new Rust project and write code snippets in src/lib.rs

as you read it (you can paste it directly from the page, but enter it better, as this automatically guarantees that you will read it completely). All parts of the code that you need are arranged in order in the article. Keep in mind that sometimes modified versions of functions that you wrote earlier are introduced, and in these cases you must replace the old version with the new one.

The code was written for rustc

version 1.34.0 using the 2018

language edition. You can use any version of the compiler, make sure that you are using an edition that supports the 2018

edition (make sure your Cargo.toml

contains edition = "2018"

). This project does not need external dependencies.

To perform the tests presented in the article, as expected, cargo test

.

Xcruciating Markup Language

We are going to write a parser for a simplified version of XML. It looks like this:

 <parent-element> <single-element attribute="value" /> </parent-element>

XML elements are opened with the <

character and an identifier consisting of a letter followed by any number of letters, numbers, and -

. This is followed by spaces and an optional list of attribute pairs: another identifier as defined earlier, followed by =

and a string in double quotes. Finally, there is either a closing />

symbol to indicate one element without children, or >

to indicate the existence of the next sequence of child elements, and a closing tag starting with </

, followed by an identifier that must match the opening tag, and a closing symbol >

.

That is all we will support. There will be no namespaces, no text nodes, and most certainly there will be no schema check. We don’t even have to worry about escaping quotation marks for strings — they start with the first double quotation mark, and they end with the next one, and that’s it.

We are going to parse these elements into a structure that looks like this:

 #[derive(Clone, Debug, PartialEq, Eq)] struct Element { name: String, attributes: Vec<(String, String)>, children: Vec<Element>, }

There are no fancy types, just a string for the name (this is the identifier at the beginning of each tag), attributes in the form of pairs of strings (identifier and value) and a list of child elements that look exactly like the parent.

(If you are printing, be sure to include the derive

section. You will need it later.)

Parser definition

Well then, it's time to write a parser.

Parsing is the process of obtaining structured data from a stream of information. A parser is what brings out this structure.

In the discipline we are about to research, the parser, in its simplest form, is a function that takes some input and returns either the parsed output along with the rest of the input, or an error saying "I could not parse this."

The parser also looks in its more complex forms. You can complicate what input, output, and error mean, and if you need good error messages that you need, but the parser remains the same: something that consumes input and produces some sort of parsed output along with what left over from input or lets you know that it cannot parse input to output.

Let's write it as a type of function.

 Fn(Input) -> Result<(Input, Output), Error>

More specifically: in our case, we want to fill in the types and get a function similar to this one. All we are going to do is convert the string to an Element

structure. At the moment, we don’t want to go into the details of error messages, so just return the part of the line that we could not parse. As a result, our function will look like this:

 Fn(&str) -> Result<(&str, Element), &str>

We use a string slice, because it is an effective pointer to a fragment of a string, and then we can separate it as we like by “consuming” the input data, cutting out the analyzed fragment and returning the remainder along with the result.

It might be cleaner to use &[u8]

(a slice of bytes matching ASCII characters) as an input type, especially because string slices behave a little differently than most slices - especially since you cannot index them with one numbers input[0]

, you should use the fragment input[0..1]

. On the other hand, they have many methods that are useful for parsing strings that do not have byte slices.

In fact, we will rely on methods, not the use of character indexes, because Unicode. In UTF-8, and all Rust strings are valid UTF-8 strings, these indices do not always correspond to individual characters, and it is better for all interested parties to ask the standard library to just work with it for us.

Our first parser

Let's try writing a parser that just looks at the first character in a string

and decides whether it is the letter a

.

 fn the_letter_a(input: &str) -> Result<(&str, ()), &str> { match input.chars().next() { Some('a') => Ok((&input['a'.len_utf8()..], ())), _ => Err(input), } }

First, let's look at the types of input and output: we take a string slice as input and we return Result

with (&str, ())

, or an error with &str

. The pair (&str, ())

is an interesting point: as we said that we should return a tuple of the following content,

parsing result and the rest of the input. &str

is the next input, and the result is just a block type ()

, because if this parser works successfully, it can have only one result (we found the letter a

), and we don’t really need to return

the letter a

in this case we just need to indicate that we were able to find it.

So, let's look at the parser code itself. We were not joking that relying on the standard library, we can avoid the headache with Unicode: we get an iterator over the characters of the string using the chars()

method and take the first character from it. This will be an element of type char

wrapped in Option

, in which None

will mean that we were trying to get char

out of an empty string.

To make matters worse, char

not necessarily even what you think of it as a Unicode character. This is likely to be what Unicode calls a “grapheme cluster” , which can consist of several char

, which are actually “scalar values” , which are two levels below grapheme clusters. However, this path leads to insanity, and for our purposes, we honestly are unlikely to see chars

outside of ASCII, so let's stop there.

We map to the Some('a')

pattern, which is the specific result we are looking for, and if that matches, we return our success value:

Ok((&input['a'.len_utf8()..], ()))

. That is, we remove the part that we just parsed ( 'a'

) from the line slice and return the rest along with our parsed value, which is just empty

()

. Always remembering the Unicode monster, before cutting, we find out the length in UTF-8 of the character 'a'

through the standard library - it is 1 (but never forget about the Unicode monster).

If we get any other Some(char)

, or if we get None

, then we return an error. As you recall, our type of error is simply a string slice, which we passed as input

and which could not be parsed. It did not start with a

, so this is our mistake. This is not a big mistake, but at least it's a little better than just “something is wrong somewhere.”

We don’t really need this parser to parse the XML, but the first thing we will have to do is find the opening character <

, so we need something very similar. We will also need to parse >

, /

and =

specifically, so maybe we can make a function that builds the parser for the character we want?

Creating a parser

Let's even think about this: we will write a function that creates a parser for a static string of any length , and not just one character. This is even simpler because the line fragment is already a valid UTF-8 line slice, and we don’t need to think about the Unicode monster.

 fn match_literal(expected: &'static str) -> impl Fn(&str) -> Result<(&str, ()), &str> { move |input| match input.get(0..expected.len()) { Some(next) if next == expected => { Ok((&input[expected.len()..], ())) } _ => Err(input), } }

Now it looks a little different.

First of all, let's look at the types. Now our function takes the expected

string as an argument and returns something similar to a parser, instead of being a parser itself. This is a function that returns a higher order function . Essentially, we are writing a function that makes a function similar to our the_letter_a

function earlier.

Thus, instead of doing work in the body of the function, we return a closure that performs this work and which corresponds to the signature of our type for the analyzer from the previous one.

Pattern matching looks the same, except that we cannot match our string literal directly, because we don’t know what exactly, therefore we use the matching condition if next == expected

. Otherwise, it is exactly the same as before, just inside the body of the circuit.

Testing our parser

Let's write a test for this to make sure we get it right.

 #[test] fn literal_parser() { let parse_joe = match_literal("Hello Joe!"); assert_eq!( Ok(("", ())), parse_joe("Hello Joe!") ); assert_eq!( Ok((" Hello Robert!", ())), parse_joe("Hello Joe! Hello Robert!") ); assert_eq!( Err("Hello Mike!"), parse_joe("Hello Mike!") ); }

First we create a parser: match_literal("Hello Joe!")

. It should consume the string "Hello Joe!"

and return the rest of the string, or fail and return the entire string.

In the first case, we just pass it the exact row that we expect, and see that it returns an empty string and the value ()

, which means "we have analyzed the expected string, and you do not need it to be returned."

In the second, we feed the string "Hello Joe! Hello Robert!"

and we see that he really uses the string "Hello Joe!"

and returns the remainder of the input: " Hello Robert!"

(leading space and everything else).

In the third, we enter the wrong input: "Hello Mike!"

and note that it really rejects input with an error. Not that Mike

is wrong, usually just not what the parser was looking for.

Parser for something less specific

So this allows us to analyze <

, >

, =

and even </

and />

. We are almost done!

The next after the opening character <

is the name of the element. We cannot do this with a simple string comparison. But we could do it with regex ...

... but let's hold back. This will be a regular expression that will be very easy to replicate in simple code, and for this we do not need to use the regex

package. Let's see if we can write our own parser for this, using only the standard Rust library.

Recall the rule for the identifier of the element name, it looks like this: one alphabetical character, followed by zero or more of any alphabetical character,

character, number or dash.

 fn identifier(input: &str) -> Result<(&str, String), &str> { let mut matched = String::new(); let mut chars = input.chars(); match chars.next() { Some(next) if next.is_alphabetic() => matched.push(next), _ => return Err(input), } while let Some(next) = chars.next() { if next.is_alphanumeric() || next == '-' { matched.push(next); } else { break; } } let next_index = matched.len(); Ok((&input[next_index..], matched)) }

As always, we first look at the type. This time we do not write a function to create a parser, we just write the parser itself, as for the first time. The noticeable difference here is that instead of the result type ()

we return a String

in the tuple along with the remaining input. This String

will contain the identifier we just analyzed.

With that in mind, we first create an empty String

and call it matched

. This will be our result. We also get an iterator for the characters in input

, which we are going to start splitting.

The first step is to see if there is a character at the beginning. We pull the first character from the iterator and check if it is a letter: next.is_alphabetic()

. The standard Rust library, of course, will help us with Unicode - it will correspond to letters in any alphabet, and not just in ASCII. If it is a letter, we put it in our string in the matched

variable, and if not, we do not look at the identifier of the element and immediately return with an error.

In the second step, we continue to pull the characters from the iterator, sending them to the line we create until we is_alphanumeric()

character that does not satisfy the is_alphanumeric()

function (it looks like is_alphabetic()

except that it also includes numbers in any alphabet) or dash '-'

.

When we first see something that does not meet these criteria, we finish the parsing, break the loop, and return a String

, remembering to cut the fragment we used from input

. Similarly, if characters end in an iterator, it means that we have reached the end of the input.

It is worth noting that we do not return with an error when we see something that is not alphanumeric or dashes. We already have enough to create a valid identifier after we match this first letter, and it is quite normal that after parsing our identifier there will be more elements in the input line for analysis, so we just stop the parsing and return our result. Only if we cannot find even this first letter, we actually return an error, because in this case there was definitely no identifier.

Remember that the Element

structure is what we are going to parse our XML document into.

 struct Element { name: String, attributes: Vec<(String, String)>, children: Vec<Element>, }

In fact, we just finished the analyzer for the first part, the name

field. String

that our parser returns goes directly there. It is also the correct parser for the first part of each attribute

.

Let's check it out.

 #[test] fn identifier_parser() { assert_eq!( Ok(("", "i-am-an-identifier".to_string())), identifier("i-am-an-identifier") ); assert_eq!( Ok((" entirely an identifier", "not".to_string())), identifier("not entirely an identifier") ); assert_eq!( Err("!not at all an identifier"), identifier("!not at all an identifier") ); }

We see that in the first case, the string "i-am-an-identifier"

parsed completely, leaving only an empty string. In the second case, the parser returns "not"

as an identifier, and the rest of the string is returned as the remaining input. In the third case, the parser fails immediately, because the first character it finds is not a letter.

Combinators

Now we can parse the opening <

character, and we can parse the next identifier, but we need to parse both . Therefore, the next step is to write another combinatorial parser function, but one that takes two parsers as input and returns a new parser that parses them both in order. In other words, a parser combinator because it combines the two parsers into a new one. Let's see if we can do this.

 fn pair<P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Fn(&str) -> Result<(&str, (R1, R2)), &str> where P1: Fn(&str) -> Result<(&str, R1), &str>, P2: Fn(&str) -> Result<(&str, R2), &str>, { move |input| match parser1(input) { Ok((next_input, result1)) => match parser2(next_input) { Ok((final_input, result2)) => Ok((final_input, (result1, result2))), Err(err) => Err(err), }, Err(err) => Err(err), } }

It gets a little trickier here, but you know what to do: start by looking at types.

First of all, we have four types of variables: P1

, P2

, R1

and R2

. These are Parser 1

, Parser 2

, Result 1

and Result 2

. P1

and P2

are functions, and you will notice that they follow the well-established pattern of parser functions: just like the return value, they take &str

as input and return Result

pair of remaining input and result, or an error.

But look at the result types of each function: P1

is the parser that gives R1

if successful, and P2

also gives R2

. And the result of the last parser that was returned from our function is (R1, R2)

. Thus, the task of this parser is to first start the analyzer P1

at the input, save its result, then run P2

at the input that returned P1

, and if both of them worked, we combine the two results into a tuple (R1, R2)

The code shows that this is exactly what he does. We start by starting the first analyzer at the input, then the second analyzer, then combine the two results into a tuple and return it. If any of these parsers fails, we will immediately return with the error that it issued.

Thus, we must be able to combine our two parsers, match_literal

and identifier

, in order to actually match_literal

first fragment of our first XML tag. Let's write a test to see if it works.

 #[test] fn pair_combinator() { let tag_opener = pair(match_literal("<"), identifier); assert_eq!( Ok(("/>", ((), "my-first-element".to_string()))), tag_opener("<my-first-element/>") ); assert_eq!(Err("oops"), tag_opener("oops")); assert_eq!(Err("!oops"), tag_opener("<!oops")); }

Seems to work! But look at this type of result: ((), String)

. Obviously, we are only interested in the right value, String

. This will happen quite often - some of our parsers only match patterns in the input without producing values, and so their output can be safely ignored. To adapt to this pattern, we are going to use our pair

combinator to write two other combinators: left

, which discards the result of the first parser and returns only the second, and its opposite is right

. We used a parser in our test above, which discards the left side and saves only our String

, instead of pair

.

Functor Introduction

But before we go that far, let's introduce another combinator that will greatly simplify the writing of these two: map

.

This combinator has one task: to change the type of result. , , , ((), String)

, String

.

, , . : |(_left, right)| right

. , Fn(A) -> B

A

— , B

— .

 fn map<P, F, A, B>(parser: P, map_fn: F) -> impl Fn(&str) -> Result<(&str, B), &str> where P: Fn(&str) -> Result<(&str, A), &str>, F: Fn(A) -> B, { move |input| match parser(input) { Ok((next_input, result)) => Ok((next_input, map_fn(result))), Err(err) => Err(err), } }

? P

— . A

. F

— , P

, , P

, — B

A

parser(input)

, , result

map_fn(result)

, A

B

, .

, , map

, Result

:

 fn map<P, F, A, B>(parser: P, map_fn: F) -> impl Fn(&str) -> Result<(&str, B), &str> where P: Fn(&str) -> Result<(&str, A), &str>, F: Fn(A) -> B, { move |input| parser(input) .map(|(next_input, result)| (next_input, map_fn(result))) }

— , «» Haskell , . A

, map

, A

B

, B

, . Rust, Option

, Result

, Iterator

Future

, . : - Rust, , , , , map

.

, , : Fn(&str) -> Result<(&str, Output), &str>

. , , , , , , , .

, :

 type ParseResult<'a, Output> = Result<(&'a str, Output), &'a str>;

, , , ParseResult<String>

. , , Rust . , rustc

, , .

'a

, , .

. , , . , .

 trait Parser<'a, Output> { fn parse(&self, input: &'a str) -> ParseResult<'a, Output>; }

: parse()

, : , , .

, , :

 impl<'a, F, Output> Parser<'a, Output> for F where F: Fn(&'a str) -> ParseResult<Output>, { fn parse(&self, input: &'a str) -> ParseResult<'a, Output> { self(input) } }

, , , Parser

, .

, , . map

, .

 fn map<'a, P, F, A, B>(parser: P, map_fn: F) -> impl Parser<'a, B> where P: Parser<'a, A>, F: Fn(A) -> B, { move |input| parser.parse(input) .map(|(next_input, result)| (next_input, map_fn(result))) }

: , parser.parse(input)

, , P

, , Parser

, , Parser

. , . 'a'

, , .

pair

, :

 fn pair<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, (R1, R2)> where P1: Parser<'a, R1>, P2: Parser<'a, R2>, { move |input| match parser1.parse(input) { Ok((next_input, result1)) => match parser2.parse(next_input) { Ok((final_input, result2)) => Ok((final_input, (result1, result2))), Err(err) => Err(err), }, Err(err) => Err(err), } }

: parser.parse(input)

parser(input)

.

, pair

, map

.

 fn pair<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, (R1, R2)> where P1: Parser<'a, R1>, P2: Parser<'a, R2>, { move |input| { parser1.parse(input).and_then(|(next_input, result1)| { parser2.parse(next_input) .map(|(last_input, result2)| (last_input, (result1, result2))) }) } }

and_then

Result

map

, , Result

, Result

. match

. and_then

, , map

, left

right

.

Left Right

pair

map

, left

right

:

 fn left<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, R1> where P1: Parser<'a, R1>, P2: Parser<'a, R2>, { map(pair(parser1, parser2), |(left, _right)| left) } fn right<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, R2> where P1: Parser<'a, R1>, P2: Parser<'a, R2>, { map(pair(parser1, parser2), |(_left, right)| right) }

pair

, , map

, .

, , .

, Parser

ParseResult

. match_literal

:

 fn match_literal<'a>(expected: &'static str) -> impl Parser<'a, ()> { move |input: &'a str| match input.get(0..expected.len()) { Some(next) if next == expected => Ok((&input[expected.len()..], ())), _ => Err(input), } }

, , — &'a str

, rustc

.

identifier

, , , :

 fn identifier(input: &str) -> ParseResult<String> {

. ()

. .

 #[test] fn right_combinator() { let tag_opener = right(match_literal("<"), identifier); assert_eq!( Ok(("/>", "my-first-element".to_string())), tag_opener.parse("<my-first-element/>") ); assert_eq!(Err("oops"), tag_opener.parse("oops")); assert_eq!(Err("!oops"), tag_opener.parse("<!oops")); }

. <

. What's next? .

, , . , .

, , - , : .

( ) . .

— , <element attribute="value"/>

, . , , .

identifier

, . , .

 fn one_or_more<'a, P, A>(parser: P) -> impl Parser<'a, Vec<A>> where P: Parser<'a, A>, { move |mut input| { let mut result = Vec::new(); if let Ok((next_input, first_item)) = parser.parse(input) { input = next_input; result.push(first_item); } else { return Err(input); } while let Ok((next_input, next_item)) = parser.parse(input) { input = next_input; result.push(next_item); } Ok((input, result)) } }

, , , A

, Vec<A>

— A

.

identifier

. , , . , , .

, : ? :

 fn zero_or_more<'a, P, A>(parser: P) -> impl Parser<'a, Vec<A>> where P: Parser<'a, A>, { move |mut input| { let mut result = Vec::new(); while let Ok((next_input, next_item)) = parser.parse(input) { input = next_input; result.push(next_item); } Ok((input, result)) } }

, , .

 #[test] fn one_or_more_combinator() { let parser = one_or_more(match_literal("ha")); assert_eq!(Ok(("", vec![(), (), ()])), parser.parse("hahaha")); assert_eq!(Err("ahah"), parser.parse("ahah")); assert_eq!(Err(""), parser.parse("")); } #[test] fn zero_or_more_combinator() { let parser = zero_or_more(match_literal("ha")); assert_eq!(Ok(("", vec![(), (), ()])), parser.parse("hahaha")); assert_eq!(Ok(("ahah", vec![])), parser.parse("ahah")); assert_eq!(Ok(("", vec![])), parser.parse("")); }

: one_or_more

, , zero_or_more

, .

, , . one_or_more

zero_or_more

, - :

 fn one_or_more<'a, P, A>(parser: P) -> impl Parser<'a, Vec<A>> where P: Parser<'a, A>, { map(pair(parser, zero_or_more(parser)), |(head, mut tail)| { tail.insert(0, head); tail }) }

Rust, cons

Vec

, , Lisp, , . , : .

, : , . : ( ). , , Clone

, , , .

. , one_or_more

, , , , , - , , RangeBound

: range(0..)

zero_or_more

, range(1..)

one_or_more

, range(5..=6)

, .

. zero_or_more

one_or_more

.

, — , Rc

?

, one_or_more

zero_or_more

.

, . , . , , , >

/>

. , . , , , , .

. .

Firstly. match_literal

, . ? - , Unicode, . Rust, , char

is_whitespace

, is_alphabetic

is_alphanumeric

.

Secondly. , , is_whitespace

, identifier

.

Thirdly. , . any_char

, char

, , pred

, , : pred(any_char, |c| c.is_whitespace())

. , , : .

any_char

, , UTF-8.

 fn any_char(input: &str) -> ParseResult<char> { match input.chars().next() { Some(next) => Ok((&input[next.len_utf8()..], next)), _ => Err(input), } }

pred

, , . , . , true

, . , .

 fn pred<'a, P, A, F>(parser: P, predicate: F) -> impl Parser<'a, A> where P: Parser<'a, A>, F: Fn(&A) -> bool, { move |input| { if let Ok((next_input, value)) = parser.parse(input) { if predicate(&value) { return Ok((next_input, value)); } } Err(input) } }

, , :

 #[test] fn predicate_combinator() { let parser = pred(any_char, |c| *c == 'o'); assert_eq!(Ok(("mg", 'o')), parser.parse("omg")); assert_eq!(Err("lol"), parser.parse("lol")); }

, whitespace_char

:

 fn whitespace_char<'a>() -> impl Parser<'a, char> { pred(any_char, |c| c.is_whitespace()) }

, whitespace_char

, , , , , . space1

space0

.

 fn space1<'a>() -> impl Parser<'a, Vec<char>> { one_or_more(whitespace_char()) } fn space0<'a>() -> impl Parser<'a, Vec<char>> { zero_or_more(whitespace_char()) }

? , , . identifier

( , any_char

pred

*_or_more

). match_literal("=")

=

. , . , , .

 fn quoted_string<'a>() -> impl Parser<'a, String> { map( right( match_literal("\""), left( zero_or_more(pred(any_char, |c| *c != '"')), match_literal("\""), ), ), |chars| chars.into_iter().collect(), ) }

, , , , .

map

- , , , , , : . map

right

, right

— , : match_literal("\"")

. .

right

— . left

, , left

, , , match_literal("\"")

— . , — .

pred

any_char

, , , zero_or_more

, :

right

left

.

, . , zero_or_more

? Vec<A>

A

. any_char

char

. , , Vec<char>

. map

: , Vec<char>

String

, , String

Iterator<Item = char>

, vec_of_chars.into_iter().collect()

, String

.

, , , , , , , , .

 #[test] fn quoted_string_parser() { assert_eq!( Ok(("", "Hello Joe!".to_string())), quoted_string().parse("\"Hello Joe!\"") ); }

, , , , .

,

, , =

. , .

. , Vec<(String, String)>

, (String, String)

, zero_or_more

. , .

 fn attribute_pair<'a>() -> impl Parser<'a, (String, String)> { pair(identifier, right(match_literal("="), quoted_string())) }

! : pair

, identifier

, String

, right

=

, , quoted_string

, String

.

zero_or_more

, , .

 fn attributes<'a>() -> impl Parser<'a, Vec<(String, String)>> { zero_or_more(right(space1(), attribute_pair())) }

: ,

. right

.

 #[test] fn attribute_parser() { assert_eq!( Ok(( "", vec![ ("one".to_string(), "1".to_string()), ("two".to_string(), "2".to_string()) ] )), attributes().parse(" one=\"1\" two=\"2\"") ); }

! !

, , rustc , , . , . , rustc, , , #![type_length_limit = "…some big number…"]

, . , #![type_length_limit = "16777216"]

, . , !

, , , , NP-. element: , , , zero_or_more

, ?

, , . , , , : <

, . , (String, Vec<(String, String)>)

.

 fn element_start<'a>() -> impl Parser<'a, (String, Vec<(String, String)>)> { right(match_literal("<"), pair(identifier, attributes())) }

, , .

 fn single_element<'a>() -> impl Parser<'a, Element> { map( left(element_start(), match_literal("/>")), |(name, attributes)| Element { name, attributes, children: vec![], }, ) }

, , — Element

!

 #[test] fn single_element_parser() { assert_eq!( Ok(( "", Element { name: "div".to_string(), attributes: vec![("class".to_string(), "float".to_string())], children: vec![] } )), single_element().parse("<div class=\"float\"/>") ); }

… , .

single_element

, , , . , , , — , — .

, , ...

- Rust, , , .

. , :

 enum List<A> { Cons(A, List<A>), Nil, }

rustc , List<A>

, List::<A>::Cons

List<A>

, . rustc, , , .

, Rust. , Rust, , , , . , .

, . , List::Cons

A

A

, A

A

. , , , , List::Cons

, . , Rust — Box

.

 enum List<A> { Cons(A, Box<List<A>>), Nil, }

Box

, . , , Box<dyn Parser<'a, A>>

.

. ? , - , , . : . . , , SIMD ( ).

, Parser

Box , .

 struct BoxedParser<'a, Output> { parser: Box<dyn Parser<'a, Output> + 'a>, } impl<'a, Output> BoxedParser<'a, Output> { fn new<P>(parser: P) -> Self where P: Parser<'a, Output> + 'a, { BoxedParser { parser: Box::new(parser), } } } impl<'a, Output> Parser<'a, Output> for BoxedParser<'a, Output> { fn parse(&self, input: &'a str) -> ParseResult<'a, Output> { self.parser.parse(input) } }

, , BoxedParser

box

. BoxedParser

( BoxedParser

, ), BoxedParser::new(parser)

, , Box

. , Parser

, .

Box

, BoxedParser

Parser

, . , , , , , , , , . .

, , , .

, ? , , . quoted_string

:

 fn quoted_string<'a>() -> impl Parser<'a, String> { map( right( match_literal("\""), left( zero_or_more(pred(any_char, |c| *c != '"')), match_literal("\""), ), ), |chars| chars.into_iter().collect(), ) }

, . Parser

?

, , impl Trait

, impl Trait

.

… BoxedParser

. , impl Parser<'a, A>

, , BoxedParser<'a, A>

.

, , , Parser

.

map

, Parser

:

 trait Parser<'a, Output> { fn parse(&self, input: &'a str) -> ParseResult<'a, Output>; fn map<F, NewOutput>(self, map_fn: F) -> BoxedParser<'a, NewOutput> where Self: Sized + 'a, Output: 'a, NewOutput: 'a, F: Fn(Output) -> NewOutput + 'a, { BoxedParser::new(map(self, map_fn)) } }

'a

, , . , — , , impl Trait

.

quoted_string

:

 fn quoted_string<'a>() -> impl Parser<'a, String> { right( match_literal("\""), left( zero_or_more(pred(any_char, |c| *c != '"')), match_literal("\""), ), ) .map(|chars| chars.into_iter().collect()) }

, , .map()

right()

.

pair

, left

right

, , , , pair

. , , map

, .

— pred

. Parser

:

 fn pred<F>(self, pred_fn: F) -> BoxedParser<'a, Output> where Self: Sized + 'a, Output: 'a, F: Fn(&Output) -> bool + 'a, { BoxedParser::new(pred(self, pred_fn)) }

quoted_string

pred

, :

 zero_or_more(any_char.pred(|c| *c != '"')),

, , , zero_or_more

— « any_char

», . , zero_or_more

one_or_more

.

quoted_string

, map

single_element

:

 fn single_element<'a>() -> impl Parser<'a, Element> { left(element_start(), match_literal("/>")).map(|(name, attributes)| Element { name, attributes, children: vec![], }) }

element_start

, , , . ...

… , . , .

map

pred

Box

— !

. single_element

, , >

, />

. , , .

 fn open_element<'a>() -> impl Parser<'a, Element> { left(element_start(), match_literal(">")).map(|(name, attributes)| Element { name, attributes, children: vec![], }) }

, ? , , , zero_or_more

, ? , , — : , , .

, , : , , . , , , . , , , , , , .

 fn either<'a, P1, P2, A>(parser1: P1, parser2: P2) -> impl Parser<'a, A> where P1: Parser<'a, A>, P2: Parser<'a, A>, { move |input| match parser1.parse(input) { ok @ Ok(_) => ok, Err(_) => parser2.parse(input), } }

element

, , ( open_element

, element

).

 fn element<'a>() -> impl Parser<'a, Element> { either(single_element(), open_element()) }

. , , , . , ?

 fn close_element<'a>(expected_name: String) -> impl Parser<'a, String> { right(match_literal("</"), left(identifier, match_literal(">"))) .pred(move |name| name == &expected_name) }

pred

, ?

, :

 fn parent_element<'a>() -> impl Parser<'a, Element> { pair( open_element(), left(zero_or_more(element()), close_element(…oops)), ) }

close_element

? , .

. , parent_element

, open_element

element

parent_element

, , XML.

, , and_then

? , . and_then

: -, , , , . pair

, , , , . , and_then

Result

Option

, , , Result

Option

, , ( , )).

, .

 fn and_then<'a, P, F, A, B, NextP>(parser: P, f: F) -> impl Parser<'a, B> where P: Parser<'a, A>, NextP: Parser<'a, B>, F: Fn(A) -> NextP, { move |input| match parser.parse(input) { Ok((next_input, result)) => f(result).parse(next_input), Err(err) => Err(err), } }

, , P

, , A

. F

, map

A

B

, , and_then

A

NextP

, B

. — B

, , , NextP

.

: , , , , f

( A

), , f(result)

, B

. . , , , B

: P

, , f

P

, NextP

, , .

Parser

, map

, .

 fn and_then<F, NextParser, NewOutput>(self, f: F) -> BoxedParser<'a, NewOutput> where Self: Sized + 'a, Output: 'a, NewOutput: 'a, NextParser: Parser<'a, NewOutput> + 'a, F: Fn(Output) -> NextParser + 'a, { BoxedParser::new(and_then(self, f)) }

, , ?

, pair

:

 fn pair<'a, P1, P2, R1, R2>(parser1: P1, parser2: P2) -> impl Parser<'a, (R1, R2)> where P1: Parser<'a, R1> + 'a, P2: Parser<'a, R2> + 'a, R1: 'a + Clone, R2: 'a, { parser1.and_then(move |result1| parser2.map(move |result2| (result1.clone(), result2))) }

, : parser2.map()

parser2

, Fn

, FnOnce

, parser2

, . , Rust. , , pair

.

, Rust, close_element

, , .

 fn parent_element<'a>() -> impl Parser<'a, Element> { pair( open_element(), left(zero_or_more(element()), close_element(…oops)), ) }

, and_then

, close_element

.

 fn parent_element<'a>() -> impl Parser<'a, Element> { open_element().and_then(|el| { left(zero_or_more(element()), close_element(el.name.clone())).map(move |children| { let mut el = el.clone(); el.children = children; el }) }) }

, and_then

open_element()

, , close_element

. , open_element

and_then

. , Element

open_element

, , , .

, map

, Element

( el

) . clone()

, Fn

. ( Vec<Element>

) Element

, .

, , element

, open_element

parent_element

, , , , !

"" ?

, , map

«» Haskell?

and_then

— , Rust, , map

. flat_map

Iterator

, , .

— "". Thing<A>

, and_then

, A

Thing

, Thing

— .

, , Option<A>

, , Some(A)

None

, , Some(A)

, Some(B)

. , Future<A>

, and_then

Future

, Future

Future<A>

, Future<A>

. Future<A>

, — ¹ , Future

. , Future

, and_then

, Future

, . , Future

, .

, Rust , , , , , , , . .

, Redux

, XML, . , ( , , < div / >

, ).

 fn whitespace_wrap<'a, P, A>(parser: P) -> impl Parser<'a, A> where P: Parser<'a, A>, { right(space0(), left(parser, space0())) }

element

, element

, , , .

 fn element<'a>() -> impl Parser<'a, Element> { whitespace_wrap(either(single_element(), parent_element())) }

!

, ! , .

 #[test] fn xml_parser() { let doc = r#" <top label="Top"> <semi-bottom label="Bottom"/> <middle> <bottom label="Another bottom"/> </middle> </top>"#; let parsed_doc = Element { name: "top".to_string(), attributes: vec![("label".to_string(), "Top".to_string())], children: vec![ Element { name: "semi-bottom".to_string(), attributes: vec![("label".to_string(), "Bottom".to_string())], children: vec![], }, Element { name: "middle".to_string(), attributes: vec![], children: vec![Element { name: "bottom".to_string(), attributes: vec![("label".to_string(), "Another bottom".to_string())], children: vec![], }], }, ], }; assert_eq!(Ok(("", parsed_doc)), element().parse(doc)); }

, - , , :

 #[test] fn mismatched_closing_tag() { let doc = r#" <top> <bottom/> </middle>"#; assert_eq!(Err("</middle>"), element().parse(doc)); }

, . , , , , . , , , , . , , , , , .

: , ! , , , ² .

, , . !

, , Rust , , - , , , .

, pom

, , .

Rust nom

, pom

( , ), , .

Rust combine

, .

Haskell Parsec .

, « Haskell» , , Haskell.

License

Bodil Stokke Creative Commons Attribution-NonCommercial-ShareAlike 4.0. , http://creativecommons.org/licenses/by-nc-sa/4.0/ .

1: .

2: , .

andreevlex funkill

All Articles

Exploring combinatorial parsers with Rust

From a beginner's point of view

How to work with this article

Xcruciating Markup Language

Parser definition

Our first parser

Creating a parser

Testing our parser

Parser for something less specific

Combinators

Functor Introduction

Left Right

,

"" ?

, Redux

!

License

More articles: