ReadableRegex

A package for people who don't want to learn or read regexes

ReadableRegex.jl gives you a syntax that is much easier to write and understand than the rather cryptic standard Regex. The syntax is as close as possible to a natural language description of the Regex. Especially for nested and grouped expressions with special characters it is orders of magnitudes easier to understand than a simple regex string. You can also compile the function expression before using it with the @compile macro for lower runtime costs.

Here's an example:

Quickly, what does this regex do?

regex = r"[\+-]?(?:\d*\.)?\d+"

Compare with this:

regex = maybe(["-", "+"]) *
            maybe(zero_or_more(DIGIT) * ".") *
            one_or_more(DIGIT)

Both of these match all kinds of floating point numbers like these:

"1 2.0 .3 -.4 -5 60 700 800.9 +9000"

But to understand the Regex you have to mentally parse it into the second version. Why not avoid that effort?

ReadableRegex also escapes reserved regex characters like "." by default, saving you a lot of backslashes.

How does it work?

ReadableRegex uses a simple intermediary type RegexString which holds a string representation of a regex, the same thing you'd normally write yourself.

These RegexStrings can be concatenated together with the * operator. You can convert them into a Regex by calling Regex(rs::RegexString), or use them directly with match and eachmatch.

Let's say we wanted to extract all the possible integers from the list above. One way could be this:

julia> reg = @compile look_for(
                maybe(char_in("+-")) * one_or_more(DIGIT),
                not_after = ".",
                not_before = NON_SEPARATOR)

r"(?:(?<!\.)(?:(?:[+\-])?(?:\d)+))(?!\P{Z})"
# you do not want to read or write that...

julia> eachmatch(reg, "1 2.0 .3 -.4 -5 60 700 800.9 +9000") .|> println;

RegexMatch("1")
RegexMatch("-5") 
RegexMatch("60") 
RegexMatch("700")
RegexMatch("+9000")

Note that the regex string representation is currently not as parsimonious as possible, as non-capturing groups are used more freely than strictly needed to keep the logic simple. This should not affect performance, though.

Regex building blocks

Function	Purpose
`exactly(n, target)`	Match exactly `n` repetitions of `target`
`maybe(target)`	Match zero or one repetitions of `target`
`zero_or_more(target)`	Match zero to infinity repetitions of `target`
`one_or_more(target)`	Match one or more repetitions of `target`
`at_least(n, target)`	Match at least `n` repetitions of `target`
`between(low, high, target)`	Match between `low` and `high` repetitions of `target`
`lazy_` or `possessive_` quantifier functions (e.g., `lazy_zero_or_more`)	Lazy (as few matches as possible) or possessive (as many matches as possible without backtracking) versions of the above quantifier functions. Defaults are greedy (as many matches as possible, with backtracking).
`look_for(target; [after, before, not_after, not_before])`	Match `target` only if it is `before`, `after`, `not_after`, or `not_before` other matches. You can set multiple keywords at once.
`either(targets...)`	Match either the first, or the second, etc... target in `targets`, in order of appearance.
`capture(target; [as])`	Create a numbered capture group that you can back reference, or name it optionally with the `as` keyword.
`reference(i::Int)`	Back reference to capture group number `i` (1 based counting)
`reference(name)`	Back reference to the capture group named `name`
`char_in(args...)`	Match any character in the set given by the arguments. A string like `"abc"` matches each of its characters, in this case a, b or c. It can also be a range like `'a':'z'`, or a category constant like `WHITESPACE` or `UPPERCASE`.
`char_not_in(args...)`	Match any character not in the set given by the arguments. Arguments can be the same as in `char_in(args...)`.

Creating RegexStrings manually

If you need to, you can create a RegexString using its constructor with a properly escaped string.

r = RegexString("\\w")

Alternatively, there is the @rs_str macro which allows you to use normal regex syntax, without special escaping rules.

r = rs"\w"

This can also be useful if you want to start a chain of concatenated parts where the multiplication operator needs one of the two first elements to be a RegexString.

r = rs"c" * ["at", "ool", "ute"]

Conversions

Some constructs from Base Julia are useful to express building blocks of regular expression. You can define Base.convert(::Type{RegexString}, obj) to use these directly in ReadableRegex expressions. All building block functions call convert on their inputs, and so does the multiplication operator *.

Some predefined examples:

`String` and `Char`

Strings and char_in are escaped when converted, so you can use .+[]^$ etc. without escaping them manually.

one_or_more("+")
maybe("[text in brackets]")

`StepRange{Char, Int}`

Char ranges can be used directly and match any char within the range.

one_or_more('a':'z')
between(1, 4, '🌑':'🌘')

`AbstractVector` and `Tuple`

Using an AbstractVector or a Tuple is the same as calling either(vec_or_tup...). That means you can use any element that is itself convertible to a RegexString.

exactly(3, ['a':'z', "ha"])
maybe(('x', 'Y'))

Compilation

You can use the @compile macro to build a finished Regex out of the following expression. This might help to avoid some runtime costs. You can only use literals and the ReadableRegex constants in these expressions.

julia> @allocated BEGIN * maybe('a':'z') * exactly(1, ["hello", "hi", "what's up"]) * END
2528

julia> @allocated @compile BEGIN * maybe('a':'z') * exactly(1, ["hello", "hi", "what's up"]) * END
0

Constants

These constants reference commonly used character classes:

Constant	Regex
`WORD`	`\w`
`NON_WORD`	`\W`
`DIGIT`	`\d`
`NON_DIGIT`	`\D`
`WHITESPACE`	`\s`
`NON_WHITESPACE`	`\S`
`NON_LINEBREAK`	`.`
`ANY`	`[\s\S]`
`BEGIN`	`^`
`END`	`$`
`WORDBOUND`	`\b`
`NON_WORDBOUND`	`\B`

These constants access unicode categories. For brevity, the negated variants are not listed. They all start with NON_, like NON_LETTER, NON_PUNCTUATION, etc.

Unicode Constant	Regex
Letters
`LETTER`	`\p{L}`
`UPPERCASE`	`\p{Lu}`
`LOWERCASE`	`\p{Ll}`
`TITLECASE`	`\p{Lt}`
`MODIFIER_LETTER`	`\p{Lm}`
`OTHER_LETTER`	`\p{Lo}`
Marks
`MARK`	`\p{M}`
`NONSPACING_MARK`	`\p{Mn}`
`SPACING_COMBINING_MARK`	`\p{Mc}`
`ENCLOSING_MARK`	`\p{Me}`
Numbers
`NUMBER`	`\p{N}`
`DEC_DIGIT_NUMBER`	`\p{Nd}`
`LETTER_NUMBER`	`\p{Nl}`
`OTHER_NUMBER`	`\p{No}`
Symbols
`SYMBOL`	`\p{S}`
`MATH_SYMBOL`	`\p{Sm}`
`CURRENCY`	`\p{Sc}`
`MODIFIER_SYMBOL`	`\p{Sk}`
`OTHER_SYMBOL`	`\p{So}`
Punctuation
`PUNCTUATION`	`\p{P}`
`CONNECTOR_PUNCT`	`\p{Pc}`
`DASH_PUNCT`	`\p{Pd}`
`OPEN_PUNCT`	`\p{Ps}`
`CLOSE_PUNCT`	`\p{Pe}`
`INITIAL_PUNCT`	`\p{Pi}`
`FINAL_PUNCT`	`\p{Pf}`
`OTHER_PUNCT`	`\p{Po}`
Separators
`SEPARATOR`	`\p{Z}`
`SPACE_SEP`	`\p{Zs}`
`LINE_SEP`	`\p{Zl}`
`PARAGRAPH_SEP`	`\p{Zp}`
Other
`OTHER`	`\p{C}`
`CONTROL`	`\p{Cc}`
`FORMAT`	`\p{Cf}`
`SURROGATE`	`\p{Cs}`
`PRIVATE_USE`	`\p{Co}`
`UNASSIGNED`	`\p{Cn}`

ReadableRegex.jl

ReadableRegex

A package for people who don't want to learn or read regexes

Quickly, what does this regex do?

Compare with this:

How does it work?

Regex building blocks

Creating RegexStrings manually

Conversions

`String` and `Char`

`StepRange{Char, Int}`

`AbstractVector` and `Tuple`

Compilation

Constants

Required Packages

Used By Packages

ReadableRegex

A package for people who don't want to learn or read regexes

Quickly, what does this regex do?

Compare with this:

How does it work?

Regex building blocks

Creating RegexStrings manually

Conversions

String and Char

StepRange{Char, Int}

AbstractVector and Tuple

Compilation

Constants

Required Packages

Used By Packages

Julia Packages

`String` and `Char`

`StepRange{Char, Int}`

`AbstractVector` and `Tuple`