Define a Parsing Expression Grammar via a macro and abuse of Julia syntax.
- Rules:
@rule name = expression
- Choice: infix
,
- Sequence: infix
&
- Positive lookahead: prefix
+
- Negative lookahead: prefix
-
- Option (zero or one time): postfix
[:?]
(≡[0:1]
)- (note that [?] won't work in Julia >= 1.0 per JuliaLang/julia#22712)
- Zero or more times: postfix
[*]
(≡[0:end]
) - One or more times: postfix
[+]
(≡[1:end]
) - Exactly
m
times: postfix[m]
(≡[m:m]
) (where m is an integer) - Between
m
andn
times inclusive: postfix[m:n]
- At most
n
times: postfix[0:n]
- At least
m
times: postfix[m:end]
- Terminals:
r"regex"
,"string"
- Extra regex flags:
p
is for punctuation, and eats whitespace (\s*
) after the match;w
is for word, and impliesp
, but also makes sure match boundaries are word boundaries (\b
);h
modifiesp
andw
to eat only horizontal whitespace (\h
). Values passed to semantics functions exclude eaten whitespace.
- Extra regex flags:
- Semantics:
expression |> unary_function
(like ParserCombinator)- or
expression > nary_function
to interpolate args. - Returning the special singleton value
PEG.Failure()
from a semantics function causes the parsing expression it's attached to to fail (returnnothing
instead of a tuple). Returningnothing
from a semantics function is not special; it just makes the first part of the tuplenothing
. See the parsing function signature below.
- or
Put another way:
using PEG
@rule grammar = "using PEG\n" & rule[*]
@rule rule = r"@rule"p & nonterminal & r"="p & choice
@rule choice = seq & (r","p & seq)[*]
@rule seq = item & (r"&"p & item)[*] & (r"\|?>"p & julia_function)[:?]
@rule item = lookahead , counted
@rule lookahead = r"\("p & (r"[+-]"p) & seq & r"\)"p
@rule counted = single & (count)[:?]
@rule count = range , r"\["p & (":?" , r"[\*\+]"p) & r"]"p
@rule range = r"\["p & integer & (r":"p & (integer , r"end"w))[:?] & r"]"p
@rule integer = r"\d+"w
@rule single = parens , terminal , nonterminal
@rule parens = r"\("p & choice & r"\)"p
@rule nonterminal = r"\pL\w+"w
@rule terminal = regex , string & r"\s*"
@rule regex = r"\br" & string & r"[himpswx]*\s*"
@rule string = r"\"(\\.|[^\"])*\""
@rule julia_function = # left as an exercise ;)
Each rule defines a parsing function with the following signature:
nonterminal(input::T, cache=PEG.Cache()) where T <: AbstractString
::Union{Nothing,Tuple{Any,SubString}}
The Any
part of the return value is the abstract syntax tree, while the
SubString
is the remaining input after the parsed portion. If parsing fails,
nothing
is returned.
While you can use rules defined in this way directly, it might be more
convenient to use the functions parse_next(rule, input; whole=false)
or
parse_whole(rule, input)
. See their documentation for more information.
Call PEG.setdebug!()
to have debugging information printed during parsing.
Call PEG.setdebug!(false)
to turn it off again.
Versus ParserCombinator
PEG...
- is simpler/less featureful. PEG does not:
- backtrack, except within regexen and to try the next choice (
,
). That is, repetition[]
is always greedy and possessive (to use PCRE terminology). - have
Empty(x)
/@e_str
. Use semantics functions to discard values. - have
Dot()
. User"."
. - have
Eos()
. Useparse_whole
. - parse streams. Use
open(x->parse_whole(rule, read(x, String)), args...)
. - include parsers for two random languages.
- backtrack, except within regexen and to try the next choice (
- has nicer syntax:
- Operator precedence makes sense. Tight to loose, the operators are: postfix
[]
(whatever is in the brackets), prefix+
/-
, infix&
,|>
/>
,,
. - PC's
Plus(x)
andStar(x)
become actual plusses and stars:x[+]
andx[*]
. And PEG hasx[:?]
too. Equal
/@e_str
andPattern
/@p_str
are unneccessary, just use bare strings and regexen (with extra flags).
- Operator precedence makes sense. Tight to loose, the operators are: postfix
- does not require mutual recursion loops to be broken with
Delayed()
. - does not have special types for matchers/rules, and does not require a trampoline to "interpret" them. They're just plain functions you can call directly.
Versus StringParserPEG
PEG...
- is about half the size, though they're both pretty small.
- can produce parsers that are much faster and use less memory (see benchmark below).
- does not put the whole grammar in a string and force you to deal with multiple levels of escaping if the language you're parsing involves backslashes or quotes.
- allows regex flags (and adds new ones).
- does not have syntax for suppressing values like
-(rule)
. Use semantics functions to discard values. - has semantics functions that are easier to deal with:
- They only take the list of children, no other arguments.
- Positions of children in that list are consistent; a rule like
foo bar[*] baz
that matches zerobar
s will pass child values like[foo, [], baz]
, not[foo, baz]
. - They can return
nothing
as a value without causing the parse to fail. - Function expressions are evaluated in the scope of the caller of
@rule
, not in thePEG
module, so you don't have to prefix named functions defined outside the grammar rule itself withMain.
or whatever. - The sequence operator
&
binds tighter than the semantics binding operators|>
/>
, so you don't have to use as many parentheses.
I wrote the same PEG grammar for parsing JSON using both this PEG package and StringParserPEG. I made no real effort to optimize the grammars. I used BenchmarkTools to test them against each other and against the hand-written (and presumably more optimized) JSON parser from JSON.jl. I used a 10,000-line JSON file from Chevrotain, and tested each of the three parsers repeatedly parsing that file (as a string) for 60 seconds. See json-benchmark.jl for the code I used, including both the grammars and the benchmarking code. Here are some of the results:
parser | mean parsing time (ms) | memory estimate (MiB) |
---|---|---|
JSON | 3.262 | 1.49 |
PEG | 849.936 | 62.53 |
StringParserPEG | 4635.000 | 11630.00 |
Take all of this with a grain of salt. This isn't a great test because (a) these grammars don't really exercise some of the more interesting features of the PEG formalism, and (b) I put a lot less time and effort into these specific PEG grammars than presumably went into JSON.jl (they're less than 15 lines of code each!). But it does seem clear that while this PEG package isn't as efficient as a hand-written, optimized parser, it is much more efficient than StringParserPEG.
PEG 0.2 works with julia 0.6, while PEG 1.0 works with julia 1.x (and julia 0.7). Julia 1.0 has a number of differences from julia 0.7 that required some changes to PEG, which will in turn require some minor syntactic changes to any grammars written with PEG 0.2 if you want to use them with PEG 1.0/julia 1.x.
- change
>>
to|>
- change
>>>
to>
- change
|
to,
- Note that this also makes it so you don't have to put parens around your lambda expressions.
- change
[?]
to[:?]
There are some other changes outside the grammar syntax as well:
- If you were doing this to parse a stream as previously suggested:
open(x->parse_whole(rule, readstring(x)), args...)
- now you should do this instead:
open(x->parse_whole(rule, read(x, String)), args...)
- change
Void
toNothing
- change
ParseError
toMeta.ParseError
Also note that PEG.Failure
is now an immutable type (struct
). That
shouldn't really matter because it has no fields, but it is still technically a
visible change; isbits(PEG.Failure())
is now true
where before it was
false
.
This package is not "unmaintained" or "outdated", it's just simple enough that maintenance isn't often required. So don't be discouraged if the last change to the code was several years ago.
This version of PEG works with Julia 0.7 and 1.x.