Julia implementation of Byte Pair Encoding for NLP
Author chengchingwen
3 Stars
Updated Last
2 Years Ago
Started In
December 2018


Build status codecov

Pure Julia implementation of the Byte Pair Encoding (BPE) method.

The design is inspired by the original python package subword-nmt and the byte-level bpe use in openai-gpt2. BytePairEncoding.jl support different tokenize method(with the help of WordTokenizers.jl). You can simply set the tokenizer and then learn the BPE map with it.


In the Julia REPL:

]add BytePairEncoding


julia> using BytePairEncoding, WordTokenizers

# using the bpe from openai gpt
julia> bpe = Bpe(""))
GenericBPE{String}(n_merge=40000, endsym=</w>, oldendsym=</w>, input_transform=tokenize)

# reset the tokenize method to do lowercase before tokenization
julia> bpe = GenericBPE(bpe; input_transform = tokenizelowercase)
GenericBPE{String}(n_merge=40000, endsym=</w>, oldendsym=</w>, input_transform=ComposedFunction{typeof(tokenize), typeof(lowercase)}(WordTokenizers.tokenize, lowercase))

# segment the sentence
julia> bpe("Peter Piper picked a peck of pickled peppers")
8-element Vector{String}:

# using the byte level bpe from openai gpt2
julia> bbpe = ByteLevelBPE(""))
GenericBPE{String}(n_merge=50000, input_transform=gpt2_tokenizer, codemap=BytePairEncoding.CodeMap(StepRange{Char,
Int64}['\0':1:' ', '\x7f':1:' ', '\uad':1:'\uad'], StepRange{Char, Int64}['Ā':1:'Ġ', 'ġ':1:'ł', 'Ń':1:'Ń']))

# segment the sentence
julia> bbpe("This is a 😺")
5-element Vector{String}:

# to see the origin input, set the output_transform method that unmap the codepoint
julia> decoded_bbpe = GenericBPE(bbpe; output_transform = BytePairEncoding.UnMap(bbpe.codemap))
GenericBPE{String}(n_merge=50000, input_transform=gpt2_tokenizer, output_transform=BytePairEncoding.UnMap(BytePairEncoding.CodeMap(StepRange{Char, Int64}['\0':1:' ', '\x7f':1:' ', '\uad':1:'\uad'], StepRange{Char, Int64}['Ā':1:'Ġ', 'ġ':1:'ł', 'Ń':1:'Ń'])), codemap=BytePairEncoding.CodeMap(StepRange{Char, Int64}['\0':1:' ', '\x7f':1:' ', '\uad':1:'\uad'], StepRange{Char, Int64}['Ā':1:'Ġ', 'ġ':1:'ł', 'Ń':1:'Ń']))

julia> decoded_bbpe("This is a 😺")
5-element Vector{String}:
 " is"
 " a"
 " \xf0\x9f\x98"

julia> join(ans)
"This is a 😺"