Stop words are the words in a negative dictionary which are filtered out before or after processing of natural language data (text) because they are insignificant. This julia package contains a collection of stop words for multiple languages. The data is sourced from: https://github.com/stopwords-iso/stopwords-iso. Currently, this package supports 57 languages, identified by their ISO 639-3 codes:
afr ara ben bre bul cat ces dan deu ell eng epo est eus fas fin fra gle glg guj hau hbs heb hin hun hye ita jpn kor kur lat lav lit mar msa nld nor pol por ron rus slk slv som sot spa swa swe tgl tha tur ukr urd vie yor zho zul
import Pkg; Pkg.add("StopWords")
The stopwords
variable is the only exported symbol of this package. It can be regarded as a lazy dictionary of stop words for multiple languages. You can access the stop words for a given language using the language name or ISO 639 code. For example, to get the stop words for English, you can use stopwords["eng"]
, stopwords["en"]
, or stopwords["English"]
.
julia> using StopWords
julia> stopwords["eng"]
Set{String} with 1298 elements:
"nu"
"youd"
"whoever"
"shouldn"
"null"
"everywhere"
⋮
julia> stopwords["eng"] === stopwords["en"] === stopwords["English"]
true
You can also get the stop words for multiple languages at once.
julia> stopwords[["eng", "fra"]]
Set{String} with 1922 elements:
"nu"
"youd"
"ont"
"pfut"
"whoever"
"shouldn"
"enfin"
"tac"
⋮
julia> stopwords[["eng", "fra"]] === stopwords[("eng", "fra")] == stopwords["eng"] ∪ stopwords["fra"]
true
You can also get the stop words for all languages at once.
julia> stopwords[:] === stopwords[] === stopwords[StopWords.supported_languages()]
true
The StopWords.supported_languages()
function returns a set of all the languages currently supported by the package. To check if a specific language is supported, you can use the haskey
function. And for multiple languages, you can pass a list to the haskey
function.
julia> haskey(stopwords, "eng")
true
julia> haskey(stopwords, ["English", "fra"])
true
julia> haskey(stopwords, ["English", "foo"])
false