The package is registered in the
General registry and so can be installed at the REPL with
] add StringDistances.
The available distances are:
- Edit Distances
- Q-gram distances compare the set of all substrings of length
qin each string.
- Distance "modifiers" that can be applied to any distance:
- Winkler diminishes the distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but it can be defined for any string distance.
- Partial returns the minimum distance between the shorter string and substrings of the longer string.
- TokenSort adjusts for differences in word orders by reording words alphabetically.
- TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
- TokenMax combines scores using the base distance, the
TokenSetmodifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses.
TokenMax(Levenshtein())corresponds to the distance defined in fuzzywuzzy
You can always compute a certain distance between two strings using the following syntax:
evaluate(dist, s1, s2) dist(s1, s2)
For instance, with the
evaluate(Levenshtein(), "martha", "marhta") Levenshtein()("martha", "marhta")
You can also compute a distance between two iterators:
evaluate(Levenshtein(), [1, 5, 6], [1, 6, 5]) 2
compare is defined as 1 minus the normalized distance between two strings. It always returns a
Float64 between 0 and 1: a value of 0 means completely different and a value of 1 means completely similar.
evaluate(Levenshtein(), "martha", "martha") #> 0 compare("martha", "martha", Levenshtein()) #> 1.0
findmaxreturns the value and index of the element in
itrwith the highest similarity score with
s. Its syntax is:
findmax(s, itr, dist::StringDistance; min_score = 0.0)
findallreturns the indices of all elements in
itrwith a similarity score with
shigher than a minimum value (default to 0.8). Its syntax is:
findall(s, itr, dist::StringDistance; min_score = 0.8)
findall are particularly optimized for
DamerauLevenshtein distances (as well as their modifications via
- The stringdist Package for Approximate String Matching Mark P.J. van der Loo