StringParserPEG is a string parsing library for Parsing Expression Grammars (PEG) in Julia. It separated from PEGParser.jl (written mainly by Abe Schneider) in 2016, which in turn was inspired by pyparsing, parsimonious, boost::spirit, as well as several others. The separation followed a rework of the design undertaken by Henry Schurkus to base the parsing mechanism on strings instead of macros. The string based design allows to disentangle the functionality of the library from the julia internal macro parsing mechanism and therefore makes it more robust against changes in the julia base code (the discussion leading to the separation of the packages can be found here). With the redesign also came a change of the API for easier and less error prone use. Below we describe the new design which takes grammar declarations in the form of (multiline) strings.
A parser takes a string and a grammar specification to turn the former into a computable structure. StringParserPEG does this by first parsing (parse(grammar,string)
) the string into an Abstract Syntax Tree (AST) and then transforming this AST into the required structure (transform(function,AST)
).
A grammar can be defined as:
Grammar("""
rule1 => ...
rule2 => ...
...
""")
where the following rules can be used:
- References to other rules:
theotherrule
- Terminals:
'a'
(must match literally) - Or:
rule1 | rule2 | rule3
(the first rule that matches wins) - And:
rule1 & rule2 & rule3
(the rules are matched one after the other (&
groups stronger than|
) - Grouping:
(rule1 & rule2) | (rule3 & rule4)
- Optional:
?(rule1)
(is matched if possible, but counts as matched anyways) - Negation:
!(rule)
(is matched ifrule
is not, consumes nothing) - One or more:
+(rule1)
(is matched as often as possible, but has to be matched at least once) - Zero or more:
*(rule1)
(is matched as often as possible, but counts as matched even if never matched) - Regular expressions:
r([a-zA-Z]+)r
(matches whatever the regex betweenr(
and)r
matches) - Suppression:
-(rule1)
(the rule has to be matched but yields no node to the AST) - Semantic action:
rule{ expr }
(uses expr to create the node instead of the defaultno_action
; see below for more information)
The argument to Grammar()
is a String, where line ends or semicoli (;) can be used to separate rules.
All grammars by default use start
as the starting rule. You can specify a different starting rule in the parse
function if you desire.
Note: All these examples and more can be found in the examples folder of StringParserPEG.
Let's start by creating a simple calculator that can take two numbers and an operator to give a result.
We first define the grammar:
calc1 = Grammar("""
start => (number & op & number)
op => plus | minus
number => (-(space) & r([0-9]+)r)
plus => (-(space) & '+')
minus => (-(space) & '-')
space => r([ \\t\\n\\r]*)r
""")
The starting rule is composed of two other rules: number
and op
. For this calculator, we only allow +
and -
.
The number
rule just matches any digit between 0 to 9. You'll note that spaces appear in front of all terminals. This is because PEGs don't handle spaces automatically.
parse(grammar,string)
allows the construction of the AST of string
according to grammar
.
Now we can run this grammar with some input:
(ast, pos, err) = parse(calc1, "4+5")
ast
resulting in the following output:
node() {StringParserPEG.AndRule}
1: node() {StringParserPEG.AndRule}
1: node() {'4',StringParserPEG.RegexRule}
2: node() {StringParserPEG.AndRule}
1: node() {'+',StringParserPEG.Terminal}
3: node() {StringParserPEG.AndRule}
1: node() {'5',StringParserPEG.RegexRule}
Finally one transforms the AST to the desired datastructure by first defining an accordingly overloaded actuator function and then calling it recursively on the AST by transform(function, ast)
.
We now have the desired AST for "4+5". For our calculator we do not want to put everything into a datastructure, but actually fold it all up directly into the final result.
For the transformation an actuator function needs to be defined, which dispatches on the name of the nodes. So we first need to give names to the parsed nodes:
calc1 = Grammar("""
start => (number & op & number) {"start"}
op => (plus | minus) {"op"}
number => (-(space) & r([0-9]+)r) {"number"}
plus => (-(space) & '+') {"plus"}
minus => (-(space) & '-') {"minus"}
space => r([ \\t\\n\\r]*)r
""")
leading to the following AST
node(start) {StringParserPEG.AndRule}
1: node(number) {StringParserPEG.AndRule}
1: node() {'4',StringParserPEG.RegexRule}
2: node(plus) {StringParserPEG.AndRule}
1: node() {'+',StringParserPEG.Terminal}
3: node(number) {StringParserPEG.AndRule}
1: node() {'5',StringParserPEG.RegexRule}
We can now define the actuator function as
toresult(node,children,::MatchRule{:default}) = node.value
toresult(node,children,::MatchRule{:number}) = parse(Int,node.value)
toresult(node,children,::MatchRule{:plus}) = +
toresult(node,children,::MatchRule{:minus}) = -
toresult(node,children,::MatchRule{:start}) = children[2](children[1],children[3])
and recursively apply it to our AST
transform(toresult,ast)
to obtain the correct result, 9
.
Not always does one want to create every node as a basic Node
type. Actions
allow to operate on parts of the AST during its very construction. An action is
specified by { action }
following any rule. Generally a function (anonymous
or explicit) has to be specified which takes the following arguments (rule, value, firstpos, lastpos, childnodes)
and may return anything which nodes
higher up in the AST can work with.
As a shorthand just specifying a name as a string, e.g. "name"
, results in a
normal node, but with the specified name set. This is how we did the naming in
example 1 above. As a side note: The action liftchild
just takes the child of
the node and returns it on the current level. This is the default action for
|
-rules - whichever child gets matched is returned in place of the |
-rule
as if we had explicitly specified
myOrRule = (rule1 | rule2) {liftchild}
Actions always apply to the single token preceding them, so in
rule1 {action} & rule2
action
applies to rule1rule1 & rule2 {action}
action
applies to rule2(rule1 & rule2) {action}
action
applies to the&
-rule joining rule1 and rule2.
For another example, in
*(rule) {action}
action
applies to the*
-rule*(rule {action})
action
applies torule
.
As our calculator is really very simple we could have - instead of first building a named AST and then transforming it - directly parsed the string into the final result by means of actions:
calc2 = Grammar("""
start => (number & op & number){(r,v,f,l,c) -> c[2](c[1],c[3])}
op => plus | minus
number => (-(space) & r([0-9]+)r) {(r,v,f,l,c) -> parse(Int,c[1].value)}
plus => (-(space) & '+'){(a...) -> +}
minus => (-(space) & '-'){(a...) -> -}
space => r([ \\t\\n\\r]*)r
""")
which would have directly resulted in 9
when parsing parse(calc2, "4+5")
.
Note, that if you wish to define function outside the grammar definition and call them by name, then you have to do so in the scope of StringParserPEG
, e.g.
using StringParserPEG
@eval StringParserPEG begin
foo(r,v,f,l,c) = "foo"
end
Grammar("start => 'f' {foo}")
Actually, the best example for how to parse stuff can be found in the source code itself. In grammarparsing.jl
we give the grammar used to parse grammar specifications by the user. While it is not actually live code, its consistency with what really happens is ensured by having it be a test in the test suite. Look here if you ever wonder about any specifics of grammar specification.
Negation is easily misunderstood as "matching anything that isn't rule
". That's not quite correct, since !()
doesn't actually consume anything (it's not matching, so how much would you assume it consumes anyways ;)). To understand the what negation does, consider a simple example.
We want to match any string between '<<' and '>>'. Simple enough?
g = Grammar("""start => '<<' & *(char) & '>>'
char => r(.)r""")
parse(g, "<< a << b < c >>")
errors because *(char) consumes anything after the initial '<<' -- including the final '>>', so the explicit '>>' in :start
cannot be matched. The solution:
g = Grammar("""start => '<<' & *(char) & '>>'
char => !('>>') & r(.)r""")
which only matches char until '>>'.
- The entry point to the library is of course the file
StringParserPEG.jl
which handles allimport
/export
ing and includes the other files in order. rules.jl
definesRule
and all itssubtypes
. These typically consist of aname
(which when constructed with the default constructor is simply""
), a type-specificvalue
and anaction
.grammar.jl
defines theGrammar
-type as a dictionary mapping symbols to rules.comparison.jl
defines comparison functions so that it is possible to check for example if two grammars are the same.standardactions.jl
defines some utility actions which are often needed, e.g. the above mentionedliftchild
.
after these files are read it is now possible to specify any grammar in the most julia-nique way: By manually stacking constructors into one another.
node.jl
defines theNode
-type which makes up any AST. An AST is actually just a top-level node and all its (recursive) children.parse.jl
defines the genericparse
function and its childrenparse_newcachekey
which are specified for each Rule subtype to handle the recursive parsing of a string by a specified grammar.transform.jl
defines thetransform
function mentioned above.
after these files are additionally read it is now possible to also parse and transform a string to a datastructure according to some grammar build by the manual stacking process discussed above
- Since now all functionality is in principle available,
grammarparsing.jl
defines a grammar to parse grammars by the stacking process to allow the end-user to simply specify his or her grammar as a string.
Note, that some grammar functionality is still only available by direct construction. That is because the consistent definition of such a grammargrammar becomes exponentially more difficult with the number of grammar features.
standardrules.jl
defines a grammarstandardrules
consisting only of commonly used rules likespace
,float
, etc. so that they do not have to be defined by the end user every single time. Instead, the end user can simply join these rules into his or her definition by constructing the grammar asGrammar("...", standardrules)