Phonetics.jl

A phonetics encoding library for Julia
Author Betawolf
Popularity
10 Stars
Updated Last
2 Years Ago
Started In
February 2016

Phonetics

Build Status

This Julia library implements some widely-used phonetic coding schemes, including:

  • Soundex
  • Fuzzy Soundex
  • Phonex
  • Phonix
  • The New York State Identification and Intelligence System (NYSIIS)
  • The Census Modified Statistics Canada procedure
  • The Match Rating Approach
  • Lein
  • Caverphone
  • Roger Root
  • Metaphone
  • Double Metaphone

Phonetic coding schemes are used to transform strings, particularly names, into representations which reflect how they might be pronounced or perceived to have been pronounced. In essence, they map words to codes which should be resilient to spelling variation.

For example:

using Phonetics

soundex("Smythe")
#"s530"

soundex("Smith")
#"s530"

soundex("Gail")
#"g400"

soundex("Gayle")
#"g400"

As well as the encoding schemes themselves, this package includes some comparison functionality for phonetic codes. Note that this is not edit-distance-style string comparison, but a measure of the phonetic similarity of the two strings according to a particular coding system (except for editex, which is both).

using Phonetics

#Fuzzy Soundex method of comparing codes
code_similarity("Kristina", "Christina")
#1.0

code_similarity("Kristina", "Kristian")
#0.6

#Can use different coding methods to make the comparison, with different results
code_similarity("Kristian", "Kristina", phonix)
#1.0

#The match rating approach quantifies similarity as an integer.
match_rating("Smith", "Smythe")
#5

match_rating("Smith", "Bobby")
#2

#This is compared automatically to a threshold value if a binary judgement is required.
meets_match_rating("Smith", "Smythe")
#true

meets_match_rating("Smith", "Bobby")
#false

editex("Hello", "Hullo")
#1

editex("Hellophant", "Hullo")
#10

There is a function code_match, which performs phonetic matching on an array of strings (such as might be the collect(keys()) to a Dict), so that sound-alike options are selected and returned.

#some strings
helpstrings = ["Halp", "Elf", "Hulk", "Heelp","Half", "Abba", "Any"]

code_match("Help", helpstrings)
#["Halp", "Heelp", "Half"] <- matches under the default coding scheme, fuzzy_soundex.

code_match("Help", helpstrings, double_metaphone)
#["Halp", "Heelp"] <- More reasonable matches from the better algorithm.

#You can also set a permissive value to get more matches within a system.
code_match("Help", helpstrings, fuzzy_soundex, 0.5)
#["Halp", "Elf", "Hulk", "Heelp", "Half"]

code_match("Help", helpstrings, fuzzy_soundex, 0.1)
#["Halp", "Elf", "Hulk", "Heelp","Half", "Abba", "Any"] <- lower values are more permissive

There is also a clustering function code_cluster which groups strings which are similar according to their phonetic codes.

strings = ["Sing", "Sink", "Song", "Sunk", "Sinking", "Singing", "Single"]

code_cluster(strings)
#3-element Array{Array{T,1},1}:
# ["Sing","Sink","Song","Sunk","Sinking","Singing"]
# ["Sinking","Singing"]
# ["Single"]
# ^ - note that clusters can be fairly wide, and that by default items appear in multiple clusters
# The number of clusters is organically suited to the diversity of the input set.

#clusters can be made to use only exact matches by setting both thresholds to 1.
code_cluster(strings, phonix, 1, 1)
#3-element Array{Array{T,1},1}:
# ["Sing","Sink","Song","Sunk"]
# ["Sinking","Singing"]
# ["Single"]

Finally, there are some functions for measuring more general phonetic qualities of strings.

#count the number of syllables in a word
syllable_count("Syllable")
# 3

#also works with sentences (and other languages, if rules are defined).
syllable_count("Mary had a little lamb")
# 7

#estimate how long it would take to say something
spoken_length("Mary had a little lamb")
#2.52064687529
#in seconds

#also in some other languages
spoken_length("Mary hatte ein kleines lamm", "de")
#2.10911496886