StringDistances.jl

String Distances in Julia
Popularity
60 Stars
Updated Last
5 Months Ago
Started In
October 2015

Build Status Coverage Status

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add StringDistances.

Supported Distances

The available distances are:

  • Edit Distances
  • Q-gram distances compare the set of all substrings of length q in each string.
  • Distance "modifiers" that can be applied to any distance:
    • Winkler diminishes the distance of strings with common prefixes. The Winkler adjustment was originally defined for the Jaro similarity score but it can be defined for any string distance.
    • Partial returns the minimum distance between the shorter string and substrings of the longer string.
    • TokenSort adjusts for differences in word orders by reording words alphabetically.
    • TokenSet adjusts for differences in word orders and word numbers by comparing the intersection of two strings with each string.
    • TokenMax combines scores using the base distance, the Partial, TokenSort and TokenSet modifiers, with penalty terms depending on string lengths. This is a good distance to match strings composed of multiple words, like addresses. TokenMax(Levenshtein()) corresponds to the distance defined in fuzzywuzzy

Basic Use

Evaluate

You can always compute a certain distance between two strings using the following syntax:

evaluate(dist, s1, s2)
dist(s1, s2)

For instance, with the Levenshtein distance,

evaluate(Levenshtein(), "martha", "marhta")
Levenshtein()("martha", "marhta")

You can also compute a distance between two iterators:

evaluate(Levenshtein(), [1, 5, 6], [1, 6, 5])
2

Compare

The function compare is defined as 1 minus the normalized distance between two strings. It always returns a Float64 between 0 and 1: a value of 0 means completely different and a value of 1 means completely similar.

evaluate(Levenshtein(),  "martha", "martha")
#> 0
compare("martha", "martha", Levenshtein())
#> 1.0

Find

  • findmax returns the value and index of the element in itr with the highest similarity score with s. Its syntax is:

     findmax(s, itr, dist::StringDistance; min_score = 0.0)
  • findall returns the indices of all elements in itr with a similarity score with s higher than a minimum value (default to 0.8). Its syntax is:

     findall(s, itr, dist::StringDistance; min_score = 0.8)

The functions findmax and findall are particularly optimized for Levenshtein and DamerauLevenshtein distances (as well as their modifications via Partial, TokenSort, TokenSet, or TokenMax).

References

Required Packages