WikiText.jl provides an interface to the WikiText Long Term Dependency Language Modeling dataset.
WikiText exports the following 4 types, corresponding to the 4 available datasets:
WikiText2WikiText103,WikiText2RawWikiText103Raw
Wikitext also exports following 3 functions:
trainfilevalidationfiletestfile
Downloading and unzipping the datasets will happen automatically (with your approval) when you access them for the first time, courtesy of DataDeps.jl.
julia> ]add WikiText
julia> using WikiText
julia> corpus = WikiText2v1()
julia> trainfile(corpus)
"/path/to/wiki.train.tokens"
julia> validationfile(corpus)
"/path/to/wiki.valid.tokens"