Are Austria and Australia really two different countries?Okay, I may have gotten carried away with that last one, but you get the idea.
String distance measuresWhat we want is some function that measures how similar two strings are, but is robust to small changes.
This problem is as common as it sounds: scientists have been coming up with solutions to it for a long while.
Jaccard Distance: a first approachOne of the most intuitive ones is the Jaccard distance.
It can be generalized to a distance measure for any two sets.
It is measured with the following formula:That is, how many elements are on either set, but not shared by both, divided by the total count of distinct elements.
For instance, given the strings “Albert” and “Alberto”, it will report a similarity of 85.
7%, since they share 6 letters out of a total of 7.
However, this is not a measure specifically tailored for strings.
It will fail in many use-cases, since it doesn’t really take ordering into account.
For example two anagrams, like “rail safety” and “fairy tales”, will always have a 100% match, even if those strings are quite different.
Levenshtein DistanceInvented by the Russian Scientist Vladimir Levenshtein in the ’60s, this measure is a bit more intuitive: it counts how many substitutions are needed, given a string u, to transform it into v.
For this method, a substitution is defined as:Erasing a character.
Replacing a character with another one.
The minimum amount of these operations that need to be done to u in order to turn it into v, correspond to the Levenshtein distance between those two strings.
It can be obtained recursively with this formula:Where i and j are indexes to the last character of the substring we’ll be comparing.
The second term in the last expression is equal to 1 if those characters are different, and 0 if they’re the same.
This is the measure the FuzzyWuzzy library uses.
Using FuzzyWuzzy in PythonTo obtain the similarity ratio between two strings, all we have to do is this:from fuzzywuzzy import fuzzsimilarity = fuzz.
ratio("hello","world")You probably noticed I said ratio.
The ratio method will always return a number between 0 and 100 (yeah, I’d have preferred it to be between 0 and 1, or call it a percentage, but to each their own).
It can be shown that the Levenshtein distance is at most the length of the longest string: replace all characters in the shorter one with the first part of the longer one, and then add the remaining ones.
That’s how we can normalize the distance to return a ratio, so that the number won’t fluctuate enormously given inputs with different sizes.
This solves some of the previously mentioned problems:fuzz.
ratio("Albert Thompson", "Albert G.
ratio("The Lord of the Rings II: The Two Towers", "The Lord of the Rings 2: the 2 Towers") #88%Even if it may bring a few new ones:#88% for two different countriesfuzz.
ratio("Austria","Australia")#57% but it's the same countryfuzz.
ratio("Czechia","Czech Republic")Other FuzzyWuzzy methodsThe FuzzyWuzzy library provides us not only with the vanilla Levenshtein distance, but also with a few other methods we can make use of.
partial_ratioThe partial_ratio method calculates the FuzzyWuzzy ratio for all substrings of the longer string with the length of the shorter one, and then returns the highest match.
partial_ratio("abc","a") == min([fuzz.
ratio( char, "a") for char in "abc"])This has some interesting effects:fuzz.
partial_ratio("Thomas and His Friends", "Thomas") #100%fuzz.
partial_ratio("Batman vs Superman", "Batman") #100%Effectively, the partial_ratio method could be a fuzzy replacement to the contains string method, just as the regular ratio may replace the equals method.
However, it will fail for strings that are similar, but whose words appear in a different order.
Even a slight order change will break it.
#72% with basically the same ideafuzz.
partial_ratio("Peanut Butter and Jelly", "Jelly and Peanut Butter") #86% with a random (carefully selected) stringfuzz.
partial_ratio("Peanut Butter and Jelly", "Otter and Hell")token_sort_ratioThe Token Sort Ratio divides both strings into words, then joins those again alphanumerically, before calling the regular ratio on them.
partial_ratio("Batman vs Superman", "Superman vs Batman") #100%fuzz.
partial_ratio("a b c", "c b a") #100%token_set_ratioThe Token Set Ratio separates each string into words, turns both lists into sets (discarding repeated words) and then sorts those before doing the ratio.
This way not only do we rule out shared words, we also account for repetitions.
token_set_ratio("fun","fun fun fun") #100%fuzz.
token_set_ratio("Lord the Rings of", "Lord of the Rings") #100%ConclusionsThe FuzzyWuzzy library can be a very useful tool to have under your belt.
Both for customer’s names matching, or acting as a poor man’s word embedding, it can save you a lot of trouble or help with your Machine Learning model’s feature engineering.
However, since it requires preprocessing (like turning both strings to lowercase) and doesn’t take synonyms into account, it may not be the best solution for cases where actual NLP or Clustering methods may be needed.
I hope you’ve found this article helpful, and let me know if you find another use for FuzzyWuzzy in your job!Follow me on Twitter or Medium to stay up to date with more Python tutorials, tips and tricks.
If you found this article useful, please consider supporting my site by helping me pay for its hosting.
Your donation will be of great help.
Originally published at www.
tech on April 15, 2019.