Given a collection of free-text documents and a set of keywords, we want to know: which keywords have fuzzy matches in which documents?
There are two types of errors:
- false positive: we flag a match when it does not exist
- false negative: we miss a match that should exist
Here, we primarily wish to minimize false negatives (ideally without having too many false positives either!). That is, we wish to error on the side of having too many matches:
in some cases where we have many haystacks and few needles, we'd rather have to manually filter out some pins (false needles
In particular, we want robust to the following types of mistakes in the documents which could cause us to miss a match:
- word conjunctions (missed space separating words)
- hyphenated words conjoined without a hyphen
- hyphenated words/phrase joined with spaces instead of hyphens
- Erroneously redacted terms (e.g.
Darwin's Frogbeing redacted as suspected PHI since
Darwincould be flagged as name, whereas in this case it is the name of an animal).
We also will pay attention to one source of false positives:
- Search terms that are subsets of other general terms (not the intended term class). E.g. "ant" could show up as part of "anteater", but that would not be a correct match if we're looking for the little insects.
- ...(1) by using fuzzy matching via an edit distance (i.e. a string matches another substring if it is the same up to
nedits, where we have to choose
n). Edit distances are supplied by StringDistances.jl.
- ...(2) by not requiring matches between word boundaries but just between substrings (i.e. if the query is
kingcobrashows up in the document, then we would have a perfect match with the substring
cobrafound in the document)
- ...(3) and (4) by first replacing hyphens with spaces, and then augmenting our search terms by taking any terms with spaces or hyphens and generating a list of terms with each possible choice of (spaces, no spaces). For example, the query "crab-eating macaque" is augmented to the query ("crab eating macaque" OR "crabeating macaque" OR "crab eatingmacaque" OR "crabeatingmacaque").
- ...(5) Here, we allow a global list of replacements (
KeywordSearch.AUTOMATIC_REPLACEMENTS) to manually undo erroneous redaction.
- ...(6) Here, our solution to (2) has gotten us into trouble. What we can do is add spaces to our term, e.g., instead of searching for "ant" we can search for " ant " and require an exact match. This is accomplished by e.g.
word_boundary(Query("ant"))in the language of KeywordSearch.jl.
julia> using KeywordSearch, UUIDs julia> document = Document(""" The crabeating macacue ate a crab. """, (; document_uuid = uuid4())) Document starting with "The crabeating macacue…". Metadata: (document_uuid = UUID("a703302c-eeda-46ba-8755-940a7db86b63"),) julia> query = augment(FuzzyQuery("crab-eating macaque")) Or ├─ FuzzyQuery("crab eating macaque", DamerauLevenshtein(), 2) ├─ FuzzyQuery("crabeating macaque", DamerauLevenshtein(), 2) ├─ FuzzyQuery("crab eatingmacaque", DamerauLevenshtein(), 2) └─ FuzzyQuery("crabeatingmacaque", DamerauLevenshtein(), 2) julia> m = match(query, document) QueryMatch with distance 1 at indices 5:22. julia> explain(m) The query "crabeating macaque" matched the text "The crabeating macacue ate a crab \n " with distance 1.
Currently, only the
StringDistances.DamerauLevenshtein distance measure is supported.