Combine 1.0.0 is released! This finally brings combine into stable semver territory meaning you can now use it without worrying about breakage when updating to a newer version.

Rather than just posting an announcement with the few breaking changes that happened between the announcement of the beta and now I figured I could write about how to create a parser from scratch using combine.

A simple INI parser

The INI file format is an informal standard for configuration files. As they are quite simple to parse they make a good introductory example while still demonstrating the flexibiliy of using parser combinators for parsing.

language=rust

[section]
name=combine ; Comment
type=LL(1)

From the wiki page and the example text we can see that we need parsers for three major elements, properties, sections and whitespace (including comments). Well tackle each one separately in the order which they are needed.

Parsing properties

Starting with properties we can see that we need to parse the property name followed by an equals sign followed by the value which reaches until the end of the line. In combine the easiest way of running parsers in sequence is to use tuples which lets use write the property parser like so.

(
    many1(satisfy(|c| c != '=' && c != '[' && c != ';')),
    token('='),
    many1(satisfy(|c| c != '\n' && c != ';'))
)
    .map(|(key, _, value)| (key, value))

There are four different predefined parsers doing work in this line.

  • satisfy takes a function of type (Item) -> bool and creates a parser which only accepts the items which returns true for that predicate.
  • many1 is a function which takes a parser and creates a new parser which applies the given parser one or more times (hence the 1 suffix).
  • token creates a parser which accepts the given token.
  • map transforms the result of a sucessful parse into a new result. We use it here to only return a tuple of the key and value, skipping the result of the token('=') parser.

All put together we have constructed a property parser in just a few lines which we can pass around however we want, well almost. To create a parser with combine a the Parser trait is used and each of the parsers mentioned above is its own type which implement the Parser trait. This technique is great for ensuring that the compiler never loses any information when it comes to optimizing the code. This does howoever come at the cost of some rather large types.

//Type of the property parser
Map<(Many<_, Satisfy<_, [closure@examples\ini.rs:18:19: 18:31]>>, Token<_>, Many<_, Satisfy<_, [closure@examples\ini.rs:18:60: 18:73]>>), [closure@examples\ini.rs:19:14: 19:44]>

Even if we could return closure types from functions that is still a long type we dont really want to type out. We could box the parser but that requires an allocation each time we need the parser which we do not actually need. Instead we can do one better and wrap the parser inside a function which directly calls the parser.

///Wrapping the parser in a box is easy but needs an allocation
fn property<'a, I>() -> Box<Parser<Input=I, Output=(String, String)> + 'a>
where I: Stream<Item=char> + 'a {
    Box::new((
        many1(satisfy(|c| c != '=' && c != '[' && c != ';')),
        token('='),
        many1(satisfy(|c| c != '\n' && c != ';'))
    )
        .map(|(key, _, value)| (key, value)))
}

///By hiding the parser inside a function we do not need to allocate a box each time we need the parser
fn property<I>(input: State<I>) -> ParseResult<(String, String), I>
where I: Stream<Item=char> {
    (
        many1(satisfy(|c| c != '=' && c != '[' && c != ';')),
        token('='),
        many1(satisfy(|c| c != '\n' && c != ';'))
    )
        .map(|(key, _, value)| (key, value))
        .parse_state(input)
}

Parsing whitespace and comments

Parsing whitespace are done similiarily as the property parser. We do however need a few more parsers to write it. The skip_many parses in the same way as the many parser only it ignores the result. This makes it a slightly more efficient for cases like this when we do not care about the result. Next we use the or method to create a parser which parses either whitespace through the predefined spaces parser or comments.

fn whitespace<I>(input: State<I>) -> ParseResult<(), I>
where I: Stream<Item=char> {
    let comment = (token(';'), skip_many(satisfy(|c| c != '\n')))
        .map(|_| ());
    //Wrap the `spaces().or(comment)` in `skip_many` so that it skips alternating whitespace and comments
    skip_many(spaces().or(comment))
        .parse_state(input)
}

Parsing sections

The last parser we need before putting it all together is the parser for section. A sections starts with its name enclosed in brackets so we can use the between parser here which ignores the result of the enclosing parsers and only returns the result of what is inbetween. After the name we want to skip any whitespace that are between the name and the properties in the sections so here we need to use the parser function which turns the ordinary whitespace function into a parser. We use the same parser function to also wrap the property and properties function.

fn section<I>(input: State<I>) -> ParseResult<(String, HashMap<String, String>), I>
where I: Stream<Item=char> {
    (
        between(token('['), token(']'), many(satisfy(|c| c != ']'))),
        parser(whitespace),
        parser(properties)
    )
        .map(|(name, _, properties)| (name, properties))
        .parse_state(input)
}

fn properties<I>(input: State<I>) -> ParseResult<HashMap<String, String>, I>
where I: Stream<Item=char> 
    //After each property we skip any whitespace that followed it
    many(parser(property).skip(parser(whitespace)))
        .parse_state(input)
}

Putting it all together

Finally we can put it all together into the complete INI parser. As each INI file can start with properties which do not belong to a named section we create a small struct which contains the global properties as well as the ones contained in a section. We then reuse the properties parser which we used above in the section parser to parse any properties which appear before any section. Lastly we skip any whitespace before the first property/section.

pub struct Ini {
    pub global: HashMap<String, String>,
    pub sections: HashMap<String, HashMap<String, String>>
}

fn ini<I>(input: State<I>) -> ParseResult<Ini, I>
where I: Stream<Item=char> {
    (parser(whitespace), parser(properties), many(parser(section)))
        .map(|(_, global, sections)| Ini { global: global, sections: sections })
        .parse_state(input)
}

All that is left is to try it with some data.

    let text = r#"
language=rust

[section]
name=combine; Comment
type=LL(1)

"#;
let mut expected = Ini {
    global: HashMap::new(),
    sections: HashMap::new()
};
expected.global.insert(String::from("language"), String::from("rust"));

let mut section = HashMap::new();
section.insert(String::from("name"), String::from("combine"));
section.insert(String::from("type"), String::from("LL(1)"));
expected.sections.insert(String::from("section"), section);

let result = parser(ini)
    .parse(text)
    .map(|t| t.0);
assert_eq!(result, Ok(expected));

Its always a good idea to test error cases as well. This parser is quite liberal in what it accepts so creating a good error cases is a bit tricky but we can test unclosed sections. If we parse “[error” we would expect a readable error message.

Parse error at line: 1, column: 7
Unexpected 'end of input'
Expected ']'

Which is pretty good but if we wanted to be a bit clearer we might want to say that we expected a section at this point. To do this we can use the expected combinator to attach an error message when the parser fails.

fn section<I>(input: State<I>) -> ParseResult<(String, HashMap<String, String>), I>
where I: Stream<Item=char> {
    (
        between(token('['), token(']'), many(satisfy(|c| c != ']'))),
        parser(whitespace),
        parser(properties)
    )
        .map(|(name, _, properties)| (name, properties))
        .expected("section")
        .parse_state(input)
}
Parse error at line: 1, column: 7
Unexpected 'end of input'
Expected 'section'

Which gives us an error message which may be a bit easier to understand.

The final parser

Below is the parser in its entirety. It could use some improvements but as the INI format isn’t exactly specified I prefer to keep it rather minimal which should let it be easy to extend to handle some more specific things. You can find the complete source with tests here.

#[derive(PartialEq, Debug)]
pub struct Ini {
    pub global: HashMap<String, String>,
    pub sections: HashMap<String, HashMap<String, String>>
}

fn property<I>(input: State<I>) -> ParseResult<(String, String), I>
where I: Stream<Item=char> {
    (
        many1(satisfy(|c| c != '=' && c != '[' && c != ';')),
        token('='),
        many1(satisfy(|c| c != '\n' && c != ';'))
    )
        .map(|(key, _, value)| (key, value))
        .expected("property")
        .parse_state(input)
}

fn whitespace<I>(input: State<I>) -> ParseResult<(), I>
where I: Stream<Item=char> {
    let comment = (token(';'), skip_many(satisfy(|c| c != '\n')))
        .map(|_| ());
    skip_many(skip_many1(space()).or(comment))
        .parse_state(input)
}

fn properties<I>(input: State<I>) -> ParseResult<HashMap<String, String>, I>
where I: Stream<Item=char> {
    many(parser(property).skip(parser(whitespace)))
        .parse_state(input)
}

fn section<I>(input: State<I>) -> ParseResult<(String, HashMap<String, String>), I>
where I: Stream<Item=char> {
    (
        between(token('['), token(']'), many(satisfy(|c| c != ']'))),
        parser(whitespace),
        parser(properties)
    )
        .map(|(name, _, properties)| (name, properties))
        .expected("section")
        .parse_state(input)
}

fn ini<I>(input: State<I>) -> ParseResult<Ini, I>
where I: Stream<Item=char> {
    (parser(whitespace), parser(properties), many(parser(section)))
        .map(|(_, global, sections)| Ini { global: global, sections: sections })
        .parse_state(input)
}

Whats next

Hopefully this should give a good idea about to go about building parsers using combine. Though we only parse a string but this parser could equally well be used to parse any other input stream such as iterators. If you want to see more examples of parsing using combine they can be found in the benches and tests directories of combine.

Though combine has reached 1.0.0 that does not mean that it is done. I have already two more features lying in wait, zero copy parsing and using non-cloneable iterators. These should get merged in under an experimental feature pretty soon so that they can get some testing before stabilizing them.

Thats all for now, depending on the interest I may write more about combine or the [programming langauge][languager] I am writing (also in rust) in which you can find a much more advanced example on how to use combine.