BytePairEncoding.jl

Julia implementation of Byte Pair Encoding for NLP
Author chengchingwen
Popularity
3 Stars
Updated Last
6 Months Ago
Started In
December 2018

BytePairEncoding.jl

Build status codecov

Pure Julia implementation of the Byte Pair Encoding (BPE) method.

The design is inspired by the original python package subword-nmt and the byte-level bpe use in openai-gpt2. BytePairEncoding.jl support different tokenize method(with the help of WordTokenizers.jl). You can simply set the tokenizer and then learn the BPE map with it.

Installation

In the Julia REPL:

]add BytePairEncoding

Usage

julia> using BytePairEncoding, WordTokenizers

# using the bpe from openai gpt
julia> bpe = Bpe(Base.download("https://huggingface.co/openai-gpt/resolve/main/merges.txt"))
GenericBPE{String}(n_merge=40000, endsym=</w>, oldendsym=</w>, input_transform=tokenize)

# reset the tokenize method to do lowercase before tokenization
julia> bpe = GenericBPE(bpe; input_transform = tokenizelowercase)
GenericBPE{String}(n_merge=40000, endsym=</w>, oldendsym=</w>, input_transform=ComposedFunction{typeof(tokenize), typeof(lowercase)}(WordTokenizers.tokenize, lowercase))

# segment the sentence
julia> bpe("Peter Piper picked a peck of pickled peppers")
8-element Vector{String}:
 "peter</w>"
 "piper</w>"
 "picked</w>"
 "a</w>"
 "peck</w>"
 "of</w>"
 "pickled</w>"
 "peppers</w>"

# using the byte level bpe from openai gpt2
julia> bbpe = ByteLevelBPE(Base.download("https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt"))
GenericBPE{String}(n_merge=50000, input_transform=gpt2_tokenizer, codemap=BytePairEncoding.CodeMap(StepRange{Char,
Int64}['\0':1:' ', '\x7f':1:' ', '\uad':1:'\uad'], StepRange{Char, Int64}['Ā':1:'Ġ', 'ġ':1:'ł', 'Ń':1:'Ń']))

# segment the sentence
julia> bbpe("This is a 😺")
5-element Vector{String}:
 "This"
 "Ġis"
 "Ġa"
 "ĠðŁĺ"
 "º"

# to see the origin input, set the output_transform method that unmap the codepoint
julia> decoded_bbpe = GenericBPE(bbpe; output_transform = BytePairEncoding.UnMap(bbpe.codemap))
GenericBPE{String}(n_merge=50000, input_transform=gpt2_tokenizer, output_transform=BytePairEncoding.UnMap(BytePairEncoding.CodeMap(StepRange{Char, Int64}['\0':1:' ', '\x7f':1:' ', '\uad':1:'\uad'], StepRange{Char, Int64}['Ā':1:'Ġ', 'ġ':1:'ł', 'Ń':1:'Ń'])), codemap=BytePairEncoding.CodeMap(StepRange{Char, Int64}['\0':1:' ', '\x7f':1:' ', '\uad':1:'\uad'], StepRange{Char, Int64}['Ā':1:'Ġ', 'ġ':1:'ł', 'Ń':1:'Ń']))

julia> decoded_bbpe("This is a 😺")
5-element Vector{String}:
 "This"
 " is"
 " a"
 " \xf0\x9f\x98"
 "\xba"

julia> join(ans)
"This is a 😺"