The FuzzyEmbeddingMatch
module is designed to facilitate fuzzy string matching by leveraging embeddings. It primarily consists of structures and functions to embed strings, calculate similarities between these embeddings, and find the best or all matches within a set of candidates. Key components include EmbeddedString
, MatchCandidate
, bestmatch
, and allmatches
.
This module uses memoization for embedding strings to reduce API calls.
You can install this package with
import Pkg
Pkg.add("FuzzyEmbeddingMatch")
or, from the REPL:
] add FuzzyEmbeddingMatch
To begin, make sure that your environment variable OPENAI_API_KEY
is set. If you do not have the environment variable set at the system level, you can add it with
ENV["OPENAI_API_KEY"] = "........" # Replace this with your key
EmbeddedString
: Represents a string with its associated embedding.MatchCandidate
: A candidate for matching, containing two strings, their embeddings, and a similarity score.
embed
: Embeds a string usingaiembed
fromPromptingTools.jl
.corpus
: Generates a corpus of embedded strings.getembeddings
: Returns embeddings for a vector of strings.cosinesimilarity
: Calculates cosine similarity between two embeddings.
allmatches
: Finds all matches for a given string in a list of candidates.bestmatch
: Finds the best match for a given string in a list of candidates.
# Example strings and candidates
thing = "Example string"
candidates = ["Sample text", "Example string", "Another example"]
# Finding all matches
matches = allmatches(thing, candidates)
# Output the matches
for match in matches
println(match)
end
Output:
MatchCandidate("Example string", "Sample text", 0.9022957888579418)
MatchCandidate("Example string", "Example string", 0.9999999999999998)
MatchCandidate("Example string", "Another example", 0.8847227646389876)
# Example string and candidates
thing = "Example string"
candidates = ["Sample text", "Example string", "Another example"]
# Finding the best match
best_match = bestmatch(thing, candidates)
# Output the best match
println("Best match: ", best_match)
Output:
Best match: MatchCandidate("Example string", "Example string", 0.9999999999999998)