-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for fuzzy Soundex #12
Comments
Hi @howardjp, I have an implementation of this fuzzy soundex I'd be happy to go into this package if you want it. I've translated it from the following: https://yomguithereal.github.io/talisman/phonetics/ Currently uses the I've also read the paper and unless I'm missing something, it aligns with it. On another note. Not sure if you'd see scope for them to be in #' @aliases fuzzy_soundex
#'
#'
#' @title Implementation of the "Fuzzy Soundex" algorithm.
#'
#'
#' @description Implementation of the "Fuzzy Soundex" algorithm. Function taking a a vector of names and computing its Fuzzy Soundex code.
#'
#'
#' @param word String or vector of strings to encode.
#'
#'
#' @return Returns a vector of encoded strings.
#'
#'
#' @references Article: Holmes, David and M. Catherine McCabe. "Improving Precision and Recall for Soundex Retrieval."
#' \url{http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf} \cr
#' Code based on: \url{https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/fuzzy-soundex.js}
#' Licence: \url{https://github.com/Yomguithereal/talisman/blob/master/LICENSE.txt}.
#'
#'
#' @author Kyle Haynes, \email{kyle@@kylehaynes.com.au}.
#'
#'
#' @examples
#' fuzzy_soundex(c("Holmes", "Hollmes", "David", "Daved", "Catherine", "Kathryn"))
#'
#'
#' @export
fuzzy_soundex <- function(word){
# ---- Define constants ----
# The code has been structured in such a way as to follow the following implementation of the algorithm:
# https://github.com/Yomguithereal/talisman/blob/master/src/phonetics/fuzzy-soundex.js
# Define 'translation'.
translation <- c("ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"0193017~07745501769301~7~9")
# Define sets.
set1 <- c("CS", "CZ", "TS", "TZ")
set2 <- c("HR", "WR")
set3 <- c("KN", "NG")
set4 <- c("^H|^W|^Y")
# Define 'rules'.
rules_from <- c("CA", "CC", "CK", "CE", "CHL", "CL", "CHR", "CR", "CI", "CO", "CU", "CY", "DG", "GH", "MAC", "MC", "NST", "PF", "PH", "SCH", "TIO", "TIA", "TCH")
rules_to <- c("KA", "KK", "KK", "SE", "KL", "KL", "KR", "KR", "SI", "KO", "KU", "SY", "GG", "HH", "MK", "MK", "NSS", "FF", "FF", "SSS", "SIO", "SIO", "CHH")
# ---- Checks ----
if(!is.vector(word) || is.na(word)){
stop("Input must be a vector.")
}
# ---- Code ----
# Coerce NA's to "".
if(any(is.na(word))){
na_logical <- is.na(word)
word[is.na(word)] <- ""
}
## // Deburring the string & dropping any non-alphabetical character
## name = deburr(name)
## .toUpperCase()
## .replace(/[^A-Z]/g, '');
word <- iconv(word, to = 'ASCII//TRANSLIT')
word <- toupper(word)
word <- gsub("[^A-Z]", "", word, perl = TRUE)
## if (SET1.has(firstTwoLetters))
## name = 'SS' + rest;
## else if (firstTwoLetters === 'GN')
## name = 'NN' + rest;
## else if (SET2.has(firstTwoLetters))
## name = 'RR' + rest;
## else if (firstTwoLetters === 'HW')
## name = 'WW' + rest;
## else if (SET3.has(firstTwoLetters))
## name = 'NN' + rest;
word <- gsub(paste0("^", paste0(set1, collapse = "|^")), "SS", word, perl = TRUE)
word <- gsub("^GN", "NN", word, perl = TRUE)
word <- gsub(paste0("^", paste0(set2, collapse = "|^")), "RR", word, perl = TRUE)
word <- gsub("^HW", "WW", word, perl = TRUE)
word <- gsub(paste0("^", paste0(set3, collapse = "|^")), "NN", word, perl = TRUE)
## // Applying some substitutions for endings
## const lastTwoLetters = name.slice(-2),
## initial = name.slice(0, -2);
## if (lastTwoLetters === 'CH')
## name = initial + 'KK';
## else if (lastTwoLetters === 'NT')
## name = initial + 'TT';
## else if (lastTwoLetters === 'RT')
## name = initial + 'RR';
## else if (name.slice(-3) === 'RDT')
## name = name.slice(0, -3) + 'RR';
word <- gsub("(.{1,})CH$", "\\1KK", word, perl = TRUE)
word <- gsub("(.{1,})NT$", "\\1TT", word, perl = TRUE)
word <- gsub("(.{1,})RT$", "\\1RR", word, perl = TRUE)
word <- gsub("(.{1,})RDT$", "\\1RR", word, perl = TRUE)
## // Applying the rules
## for (let i = 0, l = RULES.length; i < l; i++)
## name = name.replace(...RULES[i]);
# As gsub is already vectorised, a simple for loop will suffice.
# for(i in 1:length(rules)){
# word <- gsub(rules[[i]][1], rules[[i]][2], word, perl = TRUE)
# }
word <- stringi::stri_replace_all_fixed(word, rules_from, rules_to, vectorize_all = FALSE)
## // Caching the first letter
## const firstLetter = name[0];
first_character <- substring(word, 1, 1)
## // Translating
## let code = '';
## for (let i = 0, l = name.length; i < l; i++)
## code += TRANSLATION[name[i]] || name[i];
word <- chartr(translation[1], translation[2] , word)
## // Removing hyphens
## code = code.replace(/-/g, '');
word <- gsub("~", "", word, fixed = TRUE)
## // Squeezing the code
## code = squeeze(code);
# Code can be found here: https://github.com/Yomguithereal/talisman/blob/master/src/helpers/index.js
## /**
## * Function squeezing the given sequence by dropping consecutive duplicates.
## *
## * Note: the name was actually chosen to mimic Ruby's naming since I did not
## * find any equivalent in other standard libraries.
## *
## * @param {mixed} target - The sequence to squeeze.
## * @return {array} - The resulting sequence.
## */
## export function squeeze(target) {
## const isString = typeof target === 'string',
## sequence = seq(target),
## squeezed = [sequence[0]];
##
## for (let i = 1, l = sequence.length; i < l; i++) {
## if (sequence[i] !== sequence[i - 1])
## squeezed.push(sequence[i]);
## }
##
## return isString ? squeezed.join('') : squeezed;
## }
word <- gsub("([0-9])\\1+", "\\1", word, perl = TRUE)
## // Dealing with some initials
## if (SET4.has(code[0]))
## code = firstLetter + code;
## else
## code = firstLetter + code.slice(1);
word[grepl(set4, word)] <- paste0(first_character[grepl(set4, word)], word[grepl(set4, word)])
word[!grepl(set4, word)] <- paste0(first_character[!grepl(set4, word)], substring(word[!grepl(set4, word)], 2))
## // Dropping vowels
## code = code.replace(/0/g, '');
word <- gsub("0", "", word, perl = TRUE)
# Finally, if any NA's were passed, set them back to NA.
if(exists("na_logical")){
word[na_logical] <- NA
}
# Return hashed word(s).
return(word)
} |
@KyleHaynes, this seems pretty reasonable. The big question is, do you have test cases? |
@howardjp, good point, let me get back to you (just drowning with work atm). |
If it's any help, here's a set of testcases I use for fuzzy soundex, formatted like your CSV files. |
Awesome sauce!
I am out of town right now, and will look at this later this week or early
next!
…On Sun, Aug 4, 2019 at 5:51 PM Chris Little ***@***.***> wrote:
If it's any help, here's a set of testcases I use for fuzzy soundex,
formatted like your CSV files.
fuzzy-soundex.csv.txt
<https://github.com/howardjp/phonics/files/3465362/fuzzy-soundex.csv.txt>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABGG2T4LJDNDHRKDUM7LBTQC5FNTANCNFSM4CZB6N6A>
.
|
@chrislit - thanks for this I've run against my function and from the 27 tests strings I get two mismatches (same word) It's an issue with my function and not the test data. I'll look into this further require(data.table)
require(stringi)
csv <- "name,hash
Kristen,K6935
Krissy,K6900
Christen,K6935
peter,P3600
pete,P3000
pedro,P3600
stephen,S3150
steve,S3100
smith,S5300
smythe,S5300
gail,G4000
gayle,G4000
christine,K6935
christina,K6935
kristina,K6935
Wight,W3000
Hardt,H6000
Knight,N3000
Czech,S7000
Tsech,S7000
gnomic,N5900
Wright,R3000
Hrothgar,R3760
Hwaet,W3000
Grant,G6300
Hart,H6000
Hardt,H6000"
dt <- fread(csv)
dt[, "hash_2" := fuzzy_soundex(name)]
# Currently my function doesn't pad trailing 0's. Add this:
dt[, "hash_2" := stri_pad_right(hash_2, 5, 0)]
all.equal(dt$hash, dt$hash_2)
dt[hash != hash_2]
# name hash hash_2
# 1: Hardt H6000 R0000
# 2: Hardt H6000 R0000 |
Fixed code (Fixed in my original comment). Re-running previous test now returns no differences. Re padding out strings, could make this argument driven like the I (try) to conform to snake_case, happy to adjust the name of the function to be whatever (to fit in with the rest of the phonics library) As for the dependency of Also, moving forward, I'll create a fork and make changes there. |
Your approach to maxCodeLen is fine...
As for stringi, I am not sure. I mean, I don't mind the dependency. You
gotta do what you gotta do, amirite? As a user, I can't imagine it is
really objectionable.
…On Tue, Aug 6, 2019 at 2:53 AM Kyle Haynes ***@***.***> wrote:
Fixed code (Fixed in my original comment).
Re-running previous test now returns no differences.
Re padding out strings, could make this argument driven like the
phonics::soundex argument: maxCodeLen = 4L. Could set the length,
otherwise, if NULL no padding at all?
I (try) to conform to snake_case, happy to adjust the name of the function
to be whatever (to fit in with the rest of the phonics library)
As for the dependency of stringi let me know if you want adjust to base R.
Also, moving forward, I'll create a fork and make changes there.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABGG2WRMO4GCZN5YI64QODQDENX5ANCNFSM4CZB6N6A>
.
|
More information on fuzzy Soundex available from http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf
The text was updated successfully, but these errors were encountered: