## ParserCombinator.jl

A parser combinator library for Julia
Popularity
94 Stars
Updated Last
1 Year Ago
Started In
May 2015

# ParserCombinator

A parser combinator library for Julia, similar to those in other languages, like Haskell's Parsec or Python's pyparsing. It can parse any iterable type (not just strings) (except for regexp matchers, of course).

ParserCombinator's main advantage is its flexible design, which separates the matchers from the evaluation strategy. This makes it easy to "plug in" memoization, or debug traces, or to restrict backtracking in a similar way to Parsec - all while using the same grammar.

It also contains pre-built parsers for Graph Modelling Language and DOT.

## Example

```using ParserCombinator

# the AST nodes we will construct, with evaluation via calc()

abstract type Node end
Base.:(==)(n1::Node, n2::Node) = n1.val == n2.val
calc(n::Float64) = n
struct Inv<:Node val end
calc(i::Inv) = 1.0/calc(i.val)
struct Prd<:Node val end
calc(p::Prd) = Base.prod(map(calc, p.val))
struct Neg<:Node val end
calc(n::Neg) = -calc(n.val)
struct Sum<:Node val end
calc(s::Sum) = Base.sum(map(calc, s.val))

# the grammar (the combinators!)

sum = Delayed()
val = E"(" + sum + E")" | PFloat64()

neg = Delayed()       # allow multiple (or no) negations (eg ---3)
neg.matcher = val | (E"-" + neg > Neg)

mul = E"*" + neg
div = E"/" + neg > Inv
prd = neg + (mul | div)[0:end] |> Prd

sub = E"-" + prd > Neg
sum.matcher = prd + (add | sub)[0:end] |> Sum

all = sum + Eos()

# and test

# this prints 2.5
calc(parse_one("1+2*3/4", all)[1])

# this prints [Sum([Prd([1.0]),Prd([2.0])])]
parse_one("1+2", all)
```

Some explanation of the above:

• I used rather a lot of "syntactic sugar". You can use a more verbose, "parser combinator" style if you prefer. For example, `Seq(...)` instead of `+`, or `App(...)` instead of `>`.

• The matcher `E"xyz"` matches and then discards the string `"xyz"`.

• Every matcher returns a list of matched values. This can be an empty list if the match succeeded but matched nothing.

• The operator `+` matches the expressions to either side and appends the resulting lists. Similarly, `|` matches one of two alternatives.

• The operator `|>` calls the function to the right, passing in the results from the matchers on the left.

• `>` is similar to `|>` but interpolates the arguments (ie uses `...`). So instead of passing a list of values, it calls the function with multiple arguments.

• `Delayed()` lets you define a loop in the grammar.

• The syntax `[0:end]` is a greedy repeat of the matcher to the left. An alternative would be `Star(...)`, while `[3:4]` would match only 3 or 4 values.

And it supports packrat parsing too (more exactly, it can memoize results to avoid repeating matches).

Still, for large parsing tasks (eg parsing source code for a compiler) it would probably be better to use a wrapper around an external parser generator, like Anltr.

Note: There's an issue with the Compat library which means the code above (the assignment to `Delayed.matcher`) doesn't work with 0.3. See calc.jl for the uglier, hopefully temporary, 0.3 version.

## Install

`julia> Pkg.add("ParserCombinator")`

## Manual

### Evaluation

Once you have a grammar (see below) you can evaluate it against some input in various ways:

• `parse_one()` - a simple, recursive decent parser with backtracking, but no memoization. Returns a single result or throws a `ParserException`.

• `parse_all()` - a packrat parser, with memoization, that returns an iterator (evaluated lazily) over all possible parses of the input.

• `parse_lines()` - a parser in which the source is parsed line by line. Pre-4.0.0 Julia copies strings that are passed to regex, so this reduces memory use when using regular expressions.

• `parse_try()` - similar to Haskell's Parsec, with backtracking only inside the `Try()` matcher. More info here.

• `parse_dbg()` - as `parse_one()`, but also prints a trace of evaluation for all of the matchers that are children of a `Trace()` matchers. Can also be used with other matchers via the keword `delegate`; for example `parse_dbg(...; delegate=Cache)` will provide tracing of the packrat parser (`parse_all()` above). More info here.

These are all implemented by providing different `Config` subtypes. For more information see Design, types.jl and parsers.jl.

### Basic Matchers

In what follows, remember that the power of parser combinators comes from how you combine these. They can all be nested, refer to each other, etc etc.

#### Equality

```julia> parse_one("abc", Equal("ab"))
1-element Array{Any,1}:
"ab"

julia> parse_one("abc", Equal("abx"))
ERROR: ParserCombinator.ParserException("cannot parse")```

This is so common that there's a corresponding string literal (it's "e" for `Equal(), the corresponding matcher).

```julia> parse_one("abc", e"ab")
1-element Array{Any,1}:
"ab"```

#### Sequences

Matchers return lists of values. Multiple matchers can return lists of lists, or the results can be "flattened" a level (usually more useful):

```julia> parse_one("abc", Series(Equal("a"), Equal("b")))
2-element Array{Any,1}:
"a"
"b"

julia> parse_one("abc", Series(Equal("a"), Equal("b"); flatten=false))
2-element Array{Any,1}:
Any["a"]
Any["b"]

julia> parse_one("abc", Seq(Equal("a"), Equal("b")))
2-element Array{Any,1}:
"a"
"b"

julia> parse_one("abc", And(Equal("a"), Equal("b")))
2-element Array{Any,1}:
Any["a"]
Any["b"]

julia> parse_one("abc", e"a" + e"b")
2-element Array{Any,1}:
"a"
"b"

julia> parse_one("abc", e"a" & e"b")
2-element Array{Any,1}:
Any["a"]
Any["b"]```

Where `Series()` is implemented as `Seq()` or `And()`, depending on the value of `flatten` (default `true`).

Warning - The sugared syntax has to follow standard operator precedence, where `|` binds more tightly that `+`. This means that

`   matcher1 + matcher2 | matcher3`

is almost always an error because it means:

`   matcher1 + (matcher2 | matcher3)`

while what was intended was:

`   (matcher1 + matcher2) | matcher3`

#### Empty Values

Often, you want to match something but then discard it. An empty (or discarded) value is an empty list. This may help explain why I said flattening lists was useful above.

```julia> parse_one("abc", And(Drop(Equal("a")), Equal("b")))
2-element Array{Any,1}:
Any[]
Any["b"]

julia> parse_one("abc", Seq(Drop(Equal("a")), Equal("b")))
1-element Array{Any,1}:
"b"

julia> parse_one("abc", ~e"a" + e"b")
1-element Array{Any,1}:
"b"

julia> parse_one("abc", E"a" + e"b")
1-element Array{Any,1}:
"b"```

Note the `~` (tilde / home directory) and capital `E` in the last two examples, respectively.

#### Alternates

```julia> parse_one("abc", Alt(e"x", e"a"))
1-element Array{Any,1}:
"a"

julia> parse_one("abc", e"x" | e"a")
1-element Array{Any,1}:
"a"```

Warning - The sugared syntax has to follow standard operator precedence, where `|` binds more tightly that `+`. This means that

`   matcher1 + matcher2 | matcher3`

is almost always an error because it means:

`   matcher1 + (matcher2 | matcher3)`

while what was intended was:

`   (matcher1 + matcher2) | matcher3`

#### Regular Expressions

```julia> parse_one("abc", Pattern(r".b."))
1-element Array{Any,1}:
"abc"

julia> parse_one("abc", p".b.")
1-element Array{Any,1}:
"abc"

julia> parse_one("abc", P"." + p"b.")
1-element Array{Any,1}:
"bc"```

As with equality, a capital prefix to the string literal ("p" for "pattern" by the way) implies that the value is dropped.

Note that regular expresions do not backtrack. A typical, greedy, regular expression will match as much of the input as possible, every time that it is used. Backtracking only exists within the library matchers (which can duplicate regular expression functionality, when needed).

#### Repetition

```julia> parse_one("abc", Repeat(p"."))
3-element Array{Any,1}:
"a"
"b"
"c"

julia> parse_one("abc", Repeat(p".", 2))
2-element Array{Any,1}:
"a"
"b"

julia> collect(parse_all("abc", Repeat(p".", 2, 3)))
2-element Array{Any,1}:
Any["a","b","c"]
Any["a","b"]

julia> parse_one("abc", Repeat(p".", 2; flatten=false))
2-element Array{Any,1}:
Any["a"]
Any["b"]

julia> collect(parse_all("abc", Repeat(p".", 0, 3)))
4-element Array{Any,1}:
Any["a","b","c"]
Any["a","b"]
Any["a"]
Any[]

julia> collect(parse_all("abc", Repeat(p".", 0, 3; greedy=false)))
4-element Array{Any,1}:
Any[]
Any["a"]
Any["a","b"]
Any["a","b","c"]```

You can also use `Depth()` and `Breadth()` for greedy and non-greedy repeats directly (but `Repeat()` is more readable, I think).

The sugared version looks like this:

```julia> parse_one("abc", p"."[1:2])
2-element Array{Any,1}:
"a"
"b"

julia> parse_one("abc", p"."[1:2,:?])
1-element Array{Any,1}:
"a"

julia> parse_one("abc", p"."[1:2,:&])
2-element Array{Any,1}:
Any["a"]
Any["b"]

julia> parse_one("abc", p"."[1:2,:&,:?])
1-element Array{Any,1}:
Any["a"]```

Where the `:?` symbol is equivalent to `greedy=false` and `:&` to `flatten=false` (compare with `+` and `&` above).

There are also some well-known special cases:

```julia> collect(parse_all("abc", Plus(p".")))
3-element Array{Any,1}:
Any["a","b","c"]
Any["a","b"]
Any["a"]

julia> collect(parse_all("abc", Star(p".")))
4-element Array{Any,1}:
Any["a","b","c"]
Any["a","b"]
Any["a"]
Any[]```

#### Full Match

To ensure that all the input is matched, add `Eos()` to the end of the grammar.

```julia> parse_one("abc", Equal("abc") + Eos())
1-element Array{Any,1}:
"abc"

julia> parse_one("abc", Equal("ab") + Eos())
ERROR: ParserCombinator.ParserException("cannot parse")```

#### Transforms

Use `App()` or `>` to pass the current results to a function (or datatype constructor) as individual values.

```julia> parse_one("abc", App(Star(p"."), tuple))
1-element Array{Any,1}:
("a","b","c")

julia> parse_one("abc", Star(p".") > string)
1-element Array{Any,1}:
"abc"```

The action of `Appl()` and `|>` is similar, but everything is passed as a single argument (a list).

```julia> type Node children end

julia> parse_one("abc", Appl(Star(p"."), Node))
1-element Array{Any,1}:
Node(Any["a","b","c"])

julia> parse_one("abc", Star(p".") |> x -> map(uppercase, x))
3-element Array{Any,1}:
"A"
"B"
"C"```

Sometimes you can't write a clean grammar that just consumes data: you need to check ahead to avoid something. Or you need to check ahead to make sure something works a certain way.

```julia> parse_one("12c", Lookahead(p"\d") + PInt() + Dot())
2-element Array{Any,1}:
12
'c'

julia> parse_one("12c", Not(Lookahead(p"[a-z]")) + PInt() + Dot())
2-element Array{Any,1}:
12
'c'```

More generally, `Not()` replaces any match with failure, and failure with an empty match (ie the empty list).

### Other

#### Backtracking

By default, matchers will backtrack as necessary.

In some (unusual) cases, it is useful to disable backtracking. For example, see PCRE's "possessive" matching. This can be done here on a case-by-case basis by adding `backtrack=false` to `Repeat()`, `Alternatives()` and `Series()`, or by appending `!` to the matchers that those functions generate: `Depth!`, `Breadth!`, `Alt!`, `Seq!` and `And!`.

For example,

`collect(parse_all("123abc", Seq!(p"\d"[0:end], p"[a-z]"[0:end])))`

will give just a single match, because `Seq!` (with trailing `!`) does not backtrack the `Repeat()` child matchers.

However, since regular expressions do not backtrack, it would have been simpler, and faster, to write the above as

`collect(parse_all("123abc", p"\d+[a-z]+"))`

Using `backtrack=false` only disables backtracking of the direct children of those matchers. To disable all backtracking, then the change must be made to all matchers in the grammar. For example, in theory, the following two grammars have different backtracking behaviour:

```Series(Repeat(e"a", 0, 3), e"b"; backtrack=false)
Series(Repeat(e"a", 0, 3; backtrack=false), e"b"; backtrack=false)```

(although, in practice, they are identical, in this contrived example, because `e"a"` doesn't backtrack anyway).

This makes a grammar more efficient, but also more specific. It can reduce the memory consumed by the parser, but does not guarantee that resources will be released - see the next section for a better approach to reducing memory use.

#### Controlling Memory Use

Haskell's Parsec, if I understand correctly, does not backtrack by default. More exactly, it does not allow input that has been consumed (matched) to be read again. This reduces memory consumption (at least when parsing files, since read data can be discarded), but only accepts LL(1) grammars.

To allow parsing of a wider range of grammars, Parsec then introduces the `Try` combinator, which enables backtracking in some (generally small) portion of the grammar.

The same approach can be used with this library, using `parse_try`.

``````file1.txt:
abcdefghijklmnopqrstuvwxyz
0123456789
``````
```open("test1.txt", "r") do io
# this throws an execption because it requires backtracking
parse_try(io, p"[a-z]"[0:end] + e"m" > string)
end

open("test1.txt", "r") do io
# this (with Try(...)) works fine
parse_try(io, Try(p"[a-z]"[0:end] + e"m" > string))
end```

Without backtracking, error messages using the `Error()` matcher are much more useful (this is why Parsec can provide good error messages):

```julia> try
parse_try("?", Alt!(p"[a-z]", p"\d", Error("not letter or number")))
catch x
println(x.msg)
end
not letter or number at (1,1)
?
^```

where the `(1,1)` is line number and column - so this failed on the first character of the first line.

Finally, note that this is implemented at the source level, by restricting what text is visible to the matchers. Matchers that could backtrack will still make the attempt. So you should also disable backtracking in the matchers, where you do not need it, for an efficient grammar.

#### Spaces - Pre And Post-Fixes

The lack of a lexer can complicate the handling of whitespace when using parser combinators. This library includes the ability to add arbitrary matchers before or after named matchers in the grammar - something that can be useful for matching and discarding whitespace.

For example,

```spc = Drop(Star(Space()))

@with_pre spc begin

sum = Delayed()
val = E"(" + spc + sum + spc + E")" | PFloat64()

neg = Delayed()             # allow multiple negations (eg ---3)
neg.matcher = Nullable{Matcher}(val | (E"-" + neg > Neg))

mul = E"*" + neg
div = E"/" + neg > Inv
prd = neg + (mul | div)[0:end] |> Prd

sub = E"-" + prd > Neg
sum.matcher = Nullable{Matcher}(prd + (add | sub)[0:end] |> Sum)

all = sum + spc + Eos()

end```

extends the parser given earlier to discard whitespace between numbers and symbols. The automatc addition of `spc` as a prefix to named matchers (those assigned to a variable: `sum`, `val`, `neg`, etc) means that it only needs to be added explicitly in a few places.

#### Locating Errors

Sometimes it is useful to report to the user where the input text is "wrong". For a recursive descent parser one useful indicator is the maximum depth reached in the source.

This can be retrieved using the `Debug` config. Here is a simple example that delegates to `NoCache` (the default confg for `parse_one()`):

```grammar = p"\d+" + Eos()
source = "123abc"
debug, task = make(Debug, source, grammar; delegate=NoCache)
once(task)   # this does the parsing and throws an exception
# the debug config now contains max_iter
println(source[debug.max_iter:end])   # show the error location "abc"```

This is a little complex because I don't pre-define a function for this case (cf `parse_one()`). Please email me if you think I should (currently it's unclear what features to support directly, and what to leave for "advanced" users).

An alternative approach to error messages is to use `parse_try()` with the `Error()` matcher - see here.

#### Coding Style

Don't go reinventing regexps. The built-in regexp engine is way, way more efficient than this library could ever be. So call out to regexps liberally. The `p"..."` syntax makes it easy.

But don't use regular expressions if you need to backtrack what is being matched.

Drop stuff you don't need.

Transform things into containers so that your result has nice types. Look at how the example works.

Understand the format you are parsing. What motivated the person who designed the format? Compare the GML and DOT parsers - they return different results because the format authors cared about different things. GML is an elegant, general data format, while DOT is a sequential description - a program, almost - that encodes graph layouts.

First, are you sure you need to add a matcher? You can do a lot with transforms.

If you do, here are some places to get started:

• `Equal()` in matchers.jl is a great example for something that does a simple, single thing, and returns success or failure.

• Most matchers that call to a sub-matcher can be implemented as transforms. But if you insist, there's an example in case.jl.

• If you want to write complex, stateful matchers then I'm afraid you're going to have to learn from the code for `Repeat()` and `Series()`. Enjoy!

#### Debugging

Debugging a grammar can be a frustrating experience - there are times when it really helps to have a simple view "inside" what is happening. This is supported by `parse_dbg` which will print a record of all messages (execute and response - see design) for matchers inside a `Trace()` matcher.

In addition, if the grammar is defined inside a `@with_names` macro, the symbols used to identify various parts of the grammar (the variable names) are displayed when appropriate.

Here's a full example (you can view less by applying `Trace()` to only the matchers you care about):

```@with_names begin

sum = Delayed()
val = E"(" + sum + E")" | PFloat64()

neg = Delayed()             # allow multiple negations (eg ---3)
neg.matcher = val | (E"-" + neg > Neg)

mul = E"*" + neg
div = E"/" + neg > Inv
prd = neg + (mul | div)[0:end] |> Prd

sub = E"-" + prd > Neg
sum.matcher = prd + (add | sub)[0:end] |> Sum

all = sum + Eos()
end

parse_dbg("1+2*3/4", Trace(all))```

which gives:

``````  1:1+2*3/4    00 Trace->all
1:1+2*3/4    01  all->sum
1:1+2*3/4    02   Transform->Seq
1:1+2*3/4    03    Seq->prd
1:1+2*3/4    04     prd->Seq
1:1+2*3/4    05      Seq->neg
1:1+2*3/4    06       Alt->Seq
1:1+2*3/4    07        Seq->Drop
1:1+2*3/4    08         Drop->Equal
:           08         Drop<-!!!
:           07        Seq<-!!!
:           06       Alt<-!!!
1:1+2*3/4    06       Alt->Transform
1:1+2*3/4    07        Transform->Pattern
2:+2*3/4     07        Transform<-{"1"}
2:+2*3/4     06       Alt<-{1.0}
2:+2*3/4     05      Seq<-{1.0}
2:+2*3/4     05      Seq->Depth
2:+2*3/4     06       Depth->Alt
2:+2*3/4     07        Alt->mul
2:+2*3/4     08         mul->Drop
2:+2*3/4     09          Drop->Equal
:           09          Drop<-!!!
:           08         mul<-!!!
:           07        Alt<-!!!
2:+2*3/4     07        Alt->div
2:+2*3/4     08         div->Seq
2:+2*3/4     09          Seq->Drop
2:+2*3/4     10 Drop->Equal
:           10 Drop<-!!!
:           09          Seq<-!!!
:           08         div<-!!!
:           07        Alt<-!!!
:           06       Depth<-!!!
2:+2*3/4     05      Seq<-{}
2:+2*3/4     04     prd<-{1.0}
2:+2*3/4     03    Seq<-{Prd({1.0})}
2:+2*3/4     03    Seq->Depth
2:+2*3/4     04     Depth->Alt
2:+2*3/4     07        Drop->Equal
3:2*3/4      07        Drop<-{"+"}
3:2*3/4      07        prd->Seq
3:2*3/4      08         Seq->neg
3:2*3/4      09          Alt->Seq
3:2*3/4      10 Seq->Drop
3:2*3/4      11  Drop->Equal
:           11  Drop<-!!!
:           10 Seq<-!!!
:           09          Alt<-!!!
3:2*3/4      09          Alt->Transform
3:2*3/4      10 Transform->Pattern
4:*3/4       10 Transform<-{"2"}
4:*3/4       09          Alt<-{2.0}
4:*3/4       08         Seq<-{2.0}
4:*3/4       08         Seq->Depth
4:*3/4       09          Depth->Alt
4:*3/4       10 Alt->mul
4:*3/4       11  mul->Drop
4:*3/4       12   Drop->Equal
5:3/4        12   Drop<-{"*"}
5:3/4        11  mul<-{}
5:3/4        11  mul->neg
5:3/4        12   Alt->Seq
5:3/4        13    Seq->Drop
5:3/4        14     Drop->Equal
:           14     Drop<-!!!
:           13    Seq<-!!!
:           12   Alt<-!!!
5:3/4        12   Alt->Transform
5:3/4        13    Transform->Pattern
6:/4         13    Transform<-{"3"}
6:/4         12   Alt<-{3.0}
6:/4         11  mul<-{3.0}
6:/4         10 Alt<-{3.0}
6:/4         09          Depth<-{3.0}
6:/4         09          Depth->Alt
6:/4         10 Alt->mul
6:/4         11  mul->Drop
6:/4         12   Drop->Equal
:           12   Drop<-!!!
:           11  mul<-!!!
:           10 Alt<-!!!
6:/4         10 Alt->div
6:/4         11  div->Seq
6:/4         12   Seq->Drop
6:/4         13    Drop->Equal
7:4          13    Drop<-{"/"}
7:4          12   Seq<-{}
7:4          12   Seq->neg
7:4          13    Alt->Seq
7:4          14     Seq->Drop
7:4          15      Drop->Equal
:           15      Drop<-!!!
:           14     Seq<-!!!
:           13    Alt<-!!!
7:4          13    Alt->Transform
7:4          14     Transform->Pattern
8:           14     Transform<-{"4"}
8:           13    Alt<-{4.0}
8:           12   Seq<-{4.0}
8:           11  div<-{4.0}
8:           10 Alt<-{Inv(4.0)}
8:           09          Depth<-{Inv(4.0)}
8:           09          Depth->Alt
8:           10 Alt->mul
8:           11  mul->Drop
8:           12   Drop->Equal
:           12   Drop<-!!!
:           11  mul<-!!!
:           10 Alt<-!!!
8:           10 Alt->div
8:           11  div->Seq
8:           12   Seq->Drop
8:           13    Drop->Equal
:           13    Drop<-!!!
:           12   Seq<-!!!
:           11  div<-!!!
:           10 Alt<-!!!
:           09          Depth<-!!!
8:           08         Seq<-{3.0,Inv(4.0)}
8:           07        prd<-{2.0,3.0,Inv(4.0)}
8:           05      Alt<-{Prd({2.0,3.0,Inv(4.0)})}
8:           04     Depth<-{Prd({2.0,3.0,Inv(4.0)})}
8:           04     Depth->Alt
8:           07        Drop->Equal
:           07        Drop<-!!!
:           05      Alt<-!!!
8:           05      Alt->sub
8:           06       sub->Seq
8:           07        Seq->Drop
8:           08         Drop->Equal
:           08         Drop<-!!!
:           07        Seq<-!!!
:           06       sub<-!!!
:           05      Alt<-!!!
:           04     Depth<-!!!
8:           03    Seq<-{Prd({2.0,3.0,Inv(4.0)})}
8:           02   Transform<-{Prd({1.0}),Prd({2.0,3.0,Inv(4.0)})}
8:           01  all<-{Sum({Prd({1.0}),Prd({2.0,3.0,Inv(4.0)})})}
8:           01  all->Eos
8:           01  all<-{}
8:           00 Trace<-{Sum({Prd({1.0}),Prd({2.0,3.0,Inv(4.0)})})}
``````

Some things to note here:

• The number on the left is the current iterator, followed by the source at the current offset.

• The second column of numbers is the depth (relative to `Trace()`). The indentation of the messages to the right reflects this, but "wraps" every 10 levels.

• The message flow shows execute as `->` and response as `<-`. Matcher names are replaced by variable names (eg `sum`) where appropriate.

• This functionality is implemented as a separate parser `Config` instance, so has no performance penalty when not used. See debug.jl for more details.

Finally, printing a matcher gives a useful tree view of the grammar. Loops are elided with `...`:

`println(all)`

gives

``````all
+-[1]:sum
| `-TransSuccess
|   +-Seq
|   | +-[1]:prd
|   | | +-Seq
|   | | | +-[1]:neg
|   | | | | `-Alt
|   | | | |   +-[1]:Seq
|   | | | |   | +-[1]:Drop
|   | | | |   | | `-Equal
|   | | | |   | |   `-"("
|   | | | |   | +-[2]:sum...
|   | | | |   | `-[3]:Drop
|   | | | |   |   `-Equal
|   | | | |   |     `-")"
|   | | | |   +-[2]:TransSuccess
|   | | | |   | +-Pattern
|   | | | |   | | `-r"-?(\d*\.?\d+|\d+\.\d*)([eE]\d+)?"
|   | | | |   | `-f
|   | | | |   `-[3]:TransSuccess
|   | | | |     +-Seq
|   | | | |     | +-[1]:Drop
|   | | | |     | | `-Equal
|   | | | |     | |   `-"-"
|   | | | |     | `-[2]:neg...
|   | | | |     `-f
|   | | | `-[2]:Depth
|   | | |   +-Alt
|   | | |   | +-[1]:mul
|   | | |   | | +-[1]:Drop
|   | | |   | | | `-Equal
|   | | |   | | |   `-"*"
|   | | |   | | `-[2]:neg...
|   | | |   | `-[2]:div
|   | | |   |   +-Seq
|   | | |   |   | +-[1]:Drop
|   | | |   |   | | `-Equal
|   | | |   |   | |   `-"/"
|   | | |   |   | `-[2]:neg...
|   | | |   |   `-f
|   | | |   +-lo=0
|   | | |   +-hi=9223372036854775807
|   | | |   `-flatten=true
|   | | `-f
|   | `-[2]:Depth
|   |   +-Alt
|   |   | | +-[1]:Drop
|   |   | | | `-Equal
|   |   | | |   `-"+"
|   |   | | `-[2]:prd...
|   |   | `-[2]:sub
|   |   |   +-Seq
|   |   |   | +-[1]:Drop
|   |   |   | | `-Equal
|   |   |   | |   `-"-"
|   |   |   | `-[2]:prd...
|   |   |   `-f
|   |   +-lo=0
|   |   +-hi=9223372036854775807
|   |   `-flatten=true
|   `-f
`-[2]:Eos
``````

Also, `parse_XXX(...., debug=true)` will show a strack trace from within the main parse loop (which gives more information on the source of any error).

For more details, I'm afraid your best bet is the source code:

• types.jl introduces the types use throughout the code

• matchers.jl defines things like `Seq` and `Repeat`

• sugar.jl adds `+`, `[...]` etc

• extras.jl has parsers for Int, Float, etc

• parsers.jl has more info on creating the `parse_one` and `parse_all` functions

• transforms.jl defines how results can be manipulated

• tests.jl has a pile of one-liner tests that might be useful

• debug.jl shows how to enable debug mode

• case.jl has an example of a user-defined combinator

## Parsers

### Graph Modelling Language

GML describes a graph using a general dict / list format (something like JSON).

• `parse_raw` returns lists and tuples that directly match the GML structure.

• `parse_dict` places the same data in nested dicts and vectors. The keys are symbols, so you access a file using the syntax `dict[:field]`.

`parse_dict()` has two important keyword arguments: `lists` is a list of keys that should be stored as lists (default is `:graph, :node, :edge`); `unsafe` should be set to `true` if mutiple values for other keys should be discarded (default `false`). The underlying issue is that it is not clear from the file format which keys are lists, so the user must specify them; by default an error is thrown if this information is incomplete, but `unsafe` can be set if a user doesn't care about those attributes.

Note that the parser does not conform fully to the specifications: ISO 8859-1 entities are not decoded (the parser should accept UTF 8); integers and floats are 64bit; strings can be any length; no check is made for required fields.

For example, to print node IDs and edge connections in a graph

```using ParserCombinator.Parsers.GML

my_graph = "graph [
node [id 1]
node [id 2]
node [id 3]
edge [source 1 target 2]
edge [source 2 target 3]
edge [source 3 target 1]
]"

root = parse_dict(my_graph)

for graph in root[:graph]  # there could be multiple graphs
for node in graph[:node]
println("node \$(node[:id])")
end
for edge in graph[:edge]
println("edge \$(edge[:source]) - \$(edge[:target])")
end
end```

giving

``````node 1
node 2
node 3
edge 1 - 2
edge 2 - 3
edge 3 - 1
``````

### DOT

DOT describes a graph using a complex format that resembles a program (with mutable state) more than a specification (see comments in source).

• `parse_dot` returns a list of structured AST (see the types in DOT.jl), one per graph in the file. It has one keyword argument, `debug`, which takes a `Bool` and enables the usual debugging output.

• `nodes(g::Graph)` extracts a set of node names from the structured AST.

• `edges(g::Graph)` extracts a set of edge names (node name pairs) from the structured AST.

For example, to print node IDs and edge connections in a graph

```using ParserCombinator.Parsers.DOT

my_graph = "graph {
1 -- 2
2 -- 3
3 -- 1
}"

root = parse_dot(my_graph)

for node in nodes(root)
println("node \$(node)")
end
for (node1, node2) in edges(root)
println("edge \$(node1) - \$(node2)")
end```

giving

``````node 2
node 3
edge 2 - 3
edge 1 - 3
edge 1 - 2
``````

Nodes and edges are unordered (returned as a `Set`). The graph specification is undirected (cf `digraph {...}`) and so the order of nodes in an edge is in canonical (sorted) form.

## Design

For a longer discussion of the design of ParserCombinator.jl, please see this blog post, also available here.

### Overview

Parser combinators were first written (afaik) in functional languages where tail calls do not consume stack. Also, packrat parsers are easiest to implement in lazy languages, since shared, cached results are "free".

Julia has neither tail recursion optimisation nor lazy evaluation.

On the other hand, tail call optimisation is not much use when you want to support backtracking or combine results from child parsers. And it is possible to implement combinators for repeated matches using iteration rather than recursion.

In short, life is complicated. Different parser features have different costs and any particular implementation choice needs to be defended with detailed analysis. Ideally we want an approach that supports features with low overhead by default, but which can be extended to accomodate more expensive features when necessary.

This library defines the grammar in a static graph, which is then "interpreted" using an explicit trampoline (described in more detail below). The main advantages are:

• Describing the grammar in a static graph of types, rather than mutually recursive functions, gives better integration with Julia's method dispatch. So, for example, we can overload operators like `+` to sequence matchers, or use macros to modify the grammar at compile time. And the "execution" of the grammar is simple, using dispatch on the graph nodes.

• The semantics of the parser can be modified by changing the trampoline implementation (which can also be done by method dispatch on a "configuration" type). This allows, for example, the choice of whether to use memoization to be separated from the grammar itself.

• State is explicitly identified and encapsulated, simplifying both backtracking (resumption from the current state) and memoization.

• Defining new combinators is more complex. The behaviour of a matcher is defined as a group of methods that correspond to transitions in a state machine. On the other hand, with dispatch on the grammar and state nodes, the implementation remains idiomatic and compact.

• Although the "feel" and "end result" of the library are similar to other parser combinator libraries (the grammar types handled are as expected, for example), one could argue that the matchers are not "real" combinators (what is written by the user is a graph of types, not a set of recursive functions, even if the final execution logic is equivalent).

### Trampoline Protocol

A matcher is invoked by a call to

`execute(k::Config, m::Matcher, s::State, i) :: Message`

where `k` must include, at a minimum, the field `k.source` that follows the iterator protocol when used with `i`. So, for example, `next(k.source, i)` returns the next value from the source, plus a new iter.

The initial call (ie the first time a given value of `i` is used, before any backtracking) will have `s` equal to `CLEAN`.

A matcher returns a `Message` which indicates to the trampoline how processing should continue:

• `Failure` indicates that the match has failed and probably (depending on parent matcher and configuration) triggers backtracking. There is a single instance of the type, `FAILURE`.

• `Success` indicates that the match succeeded, and so contains a result (of type `Value`, which is a type alias for `Any[]`) together with the updated iter `i` and any state that the matcher will need to look for further matchers (this can be be `DIRTY` which is globally used to indicate that all further matches will fail).

• `Execute` which results in a "nested" call to a child matcher's `execute` method (as above).

The `FAILURE` and `Success` messages are processed by the trampoline and (typically, although a trampoline implementation may also use cached values) result in calls to

```failure(k::Config, m::Matcher, s::State) :: Message

success(k::Config, m::Matcher, s::State, t::State, i, r::Value) :: Message```

where the parent matcher (`m`) can do any clean-up work, resulting in a new `Message`.

Note that the child's state, `t`, is returned to the parent. It is the responsibility of the parent to save this (in its own state) if it wants to re-call the child.

### Source (Input Text) Protocol

The source text is read using the standard Julia iterator protocol, extended with several methods defined in sources.jl.

The use of iterators means that `Dot()` returns characters, not strings. But in practice that matcher is rarely used (particularly since, with strings, you can use regular expressions - `p"pattern"` for example), and you can construct a string from multiple characters using `> string`.

## Releases

1.7.0 - 2015-10-13 - Added DOT parser.

1.6.0 - 2015-07-26 - Changed from `s"` to `e"`; added support for fast regex patch.

1.5.0 - 2015-07-25 - Clarified source handling; improved GML speed.

1.4.0 - 2015-07-18 - Added GML parser; related parse_try fixes.

1.3.0 - 2015-06-27 - Added parse_try.

1.2.0 - 2015-06-28 - Trampoline side rewritten; more execution modes.

1.1.0 - 2015-06-07 - Fixed calc example; debug mode; much rewriting.

1.0.0 - ~2015-06-03 - More or less feature complete.