Julia port of SymSpell, extremely fast spelling correction and fuzzy search algorithm.
using SymSpellChecker d = SymSpell() push!(d, "hello") push!(d, "world") d["wrold"] = ["world"]
Dictionaries can be created as follows
using SymSpellChecker # Loading from file d = SymSpell("assets/frequency_dictionary_en_30_000.txt") # Manual update d = SymSpell() push!(d, "hello", 100) push!(d, "world", 50)
Third term in
push! function is the word frequency, which is used later in
lookup to sort results from highest frequency to the lowest.
SymSpell constructor has following arguments
- max_dictionary_edit_distance: maximum allowed search distance. High value of this argument requires lots of memory. Default value is 2.
- prefix_length: prefix length used to generate candidates, higher values corresponds to higher memory requirements, but smaller search times. Default value is 5
- count_threshold: words with frequencies below this threshold wouldn't show in search results.
Words search can be made as follows
lookup(d, "wrold") # [SuggestItem("world", 1, 50)]
1 is a Damerau-Levenshtein distance between
50 is a word frequency in current dictionary.
One can extract only words from
term.(lookup(d, "wrold")) = ["world"]
There is more convenient form of
d["wrold"] = ["world"]
Search arguments can be passed either in
lookup function or set globally with the help of
set_options!(d::SymSpell; kwargs...) command.
set_options!(d, include_unknown = true, verbosity = "closest") d["wrold"] = ["wrold", "world"] # this is equivalent to term.(lookup(d, include_unknown = true, verbosity = "closest"))
Following arguments are supported
- include_unknown: whether include or not original word in results, if it falls under search criteria
- ignore_token: ignore words in lookup that contain token string or regexp.
- transfer_casing: when this option set to
true, results will try to mimic casing of the original word, for example
d["Wrold"] = ["World"]
- max_edit_distance: maximum allowed distance for search. By default equals to the
- verbosity: select type of search result. Three levels of verbosity exists
- "top": only single suggestion is returned, with lowest distance and highest frequency
- "closest": all words with lowest distance are returned
- "all": all words within given