Python Tutorial: A Name Lookup Table for Fuzzy Name Data Sets

Check out this helpful six minutes video (from betterExplained.

com) which freshes up the various concepts.

Permutations/Combinations explainedPython provides in its standard library itertools already the function permutations and combinations which we want to inspect a little bit further.

Let’s apply it on our name sample to understand its mechanism.

First the combinations functionAs well as the permutations function:As you can see within the permutation the order of a name component plays a role.

There is a tuple for ('peter','alfred') as well as one for ('alfred','peter').

Within the combination the order of a name component is irrelevant.

For us, the order plays not a role, 'peter escher' is treated as'escher peter' We anyway sort the name components before we apply the double methaphone algorithm.

Therefore we use the combination function.

Always think about “pin code” as a crib for “permutations”.

Any pin code lock mechanism is based on the number of permutations of a given set of “digits” out of 10 elements: “1234” doesn’t unlock the screen which is locked with “4321”.

ATM — Photo by Mirza Babic on UnsplashBuild Up a Lookup Directory For Name CombinationsWe now build up a lookup directory class which allows us to store name component combinations and compare them to a person name which we got from the Twitter API.

Adding a person to the lookup directoryThe method add_person_to_lookup_table calculates all combinations of the provided name tuple, generates for each combination the associated double metaphone key and stores the key together with the unique person id in the lookup directory.

The sequence diagram of the method is shown below, let’s look at the various helper methods which are required.

Method: generate_combinationsAs a first step, we generate all combinations of name tuple.

The following method calculates all combinations of a name tuple.

For example for a 3 element tuple, the array built up consists ofthe 3 element tuple itselfall tuples of the combination with 2 out of 3 elementsall tuples of the combination with 1 out of 3 elementsFor example, the below 3 name tuple are resulting in the following list of combinations.

Method: add_combinations_to_directoryIn this method, we build up the directory by adding each tuple to our lookup directory.

Our lookup directory is made of a tuple of two dictionaries, which are used store the key, value pairs.

Our keys are the double metaphone tuple created from a name, and the value is the unique person identifier.

__lookup_dict = ({}, {})The method is shown belowOn code line 3 we create a normalized name string out of a name component tuple.

The method looks as followsThe result is a lower case string of the sorted concatenated tuple name elements:The string is used to generate the double metaphone tuple, which are our keys for the dictionaries.

On code line 5 and 10, we check if we have already an entry with the same key in the dictionary.

If not, we add the new key, value pair.

If there is already an entry, we check first if we have already the person_id stored.

If not we add the person_id to the value array.

Method: add_person_to_lookup_directoryFinally, we have the method ready, which allows us to add name tuples to the lookup directory.

As an example, we add the following three persons to our lookup table.

As we can see in the output, our entry for Peter with the key 'PTR' has an array of three-person identifiers.

Matching a Name in our Lookup DirectoryOur lookup directory is now ready and potentially filled with the person name data, which we retrieved via our Government API.

What is missing the match_name method which allows us to lookup a name which we retrieved via the Twitter API.

We are going to tackle this now.

Our first step is to collect all lookup directory entries – which match our searched name tuple — into a result list.

In code line 3 we generate all name component combinations via our existing method and the is iterating over all combinationsIn code line 5 and 6 we prepare the key tuple of a combination tuple for the lookup.

In code line 7 we check if our key exists.

As you can see we do the check only with the first entry of double metaphone tuple, which is stored in the first entry of the lookup directory.

We leave the full implementation based on the ranking feature of the double metaphone tuple as an exercise.

if metaphone_tuple[0] in self.

__lookup_dict[0]:Running the match_name method on our sample loaded lookup directory produces the following output:As we can see, we have two tuples which point to one person_id and one tuple peter which points to 3 persons (Obviously surnames is re-used by multiple persons).

The two tuples pointing to one person have the same id 'A123'.

That means our match identified exactly one person.

Would we have single person tuples in our result which are pointing to different persons would mean our match isn't unique.

So we enhance our method to do this uniqueness check, as well (code line 12–20):Do we have in our match list one or multiple tuples which always point to one single person?If yes we a found a unique record otherwise we return NoneLet’s verify the algorithm on our test sample (as explained above all the content is also available as an interactive Jupyter notebook)So we are ready to apply the new lookup strategy to our program.

Refactor Our Existing ClassesGovAPI class extensionWe enhance our abstract GovAPI class by integrating a NameLookupDirectory class instance.

class GovAPI(ABC): def __init__(self): self.

_members = [] self.

_nameLookupDirectory = NameLookupDirectory()We enhance add_person_record with the code sequence to build up our lookup directory (code line 22–29)We also add a new method for the match_name check, which will be called when we try to merge table records.

def match_name(self, name_tuple): return self.


match_name(name_tuple)The belowcalculate_name_matching method is not required anymore.

SocialMediaAnalyzer class refactoringIn this class we have to refactor the calculate_name_matching method as well, we now call for the matching, thematch_name method of the govAPI class instance (line 5).

If we have a match (7–14), we retrieve the full person record from the govAPI class.

Remember that thecalculate_name_matching method gets called via the Panda apply method on each row record and as a result, the row will be complemented with the additional new columns:panda_data = { 'ScreenName' : self.

__col_screen_name, 'Name': self.

__col_name, 'Description': self.

__col_description, "FollowersCount": self.

__col_followers_count, "FriendsCount": self.

__col_friends_count, "Party": self.

__col_party }df = DataFrame(panda_data, index=self.

__labels)df = df.


__calculate_name_matching, axis=1)When we execute our program once again, our table Twitter retrieved table looks like this:In col_match1 we list the govAPI unique id and in col_match2 our result tuple list, which was analyzed.



For the Twitter Name ‘Christian Levrat’ we found three entries in our lookup table:‘christian leverat ’which maps to 1 person id (1150)‘christian’ which maps to 5 person id‘leverat’ which maps to 1 person id (11509Our matching algorithm had a positive match because both entries are pointing to the same person id.

Assess our Algorithm for False PositivesLet’s check out algorithm for false positives.

What are false positives?A false positive is where you receive a positive result for a test, when you should have received a negative results.

It’s sometimes called a “false alarm” or “false positive error.

” It’s usually used in the medical field, but it can also apply to other areas (like software testing).

(Ref)False positive means there are problems with the accuracy of our algorithm.

False Positive 1When we browse through our Twitter table, we encounter the records below.

They are both pointing to the same person id.

When checking the govAPI table for the record, we get back the following record “74 Christoph Eymann”, which hasn’t a Twitter account and therefore cannot be found in the Twitter table.

What went wrong:“Christophe Darbellay” as well as “Christoph Mörgeli” were in the past politicians of the Swiss council and therefore not part of the govAPI list which we filtered for active members only.

“Christophe” as well as “Christoph” are converted to the same double metaphone string and are matching to the govAPI record 74 of “Christoph Leymann”.

Due to the fact that the govAPI list has only one person with a surname “Christoph” our algorithm returns a false positive for any person with a surname “Christoph(e)” and match it to “Christoph Leymann”.

Would the govAPI list two persons with the surname “Christoph” the match record would point to two persons id and wouldn’t unique anymore.

Our algorithm wouldn’t result in this case a false positive.

Solution:Well, we have to readjust our algorithm and make it more strict.

That means we change the condition of our name tuple generator that weonly allow name component tuples which have at least the last name as an element.

So we readjust our method and ask for the position of the last name in the tuple by the caller, when he adds a person to the directory.

def add_person_to_lookup_directory(self, person_id, name_tuple, last_name_pos): tuples = self.

generate_combinations(name_tuple) self.

add_combinations_to_directory(tuples, person_id, name_tuple[last_name_pos])In the add_combinations_to_directory method we now add only tuples which contain the last_name (line 3).

Rerunning our program results in the following matching statistics.

Matching via Lookup DirectoryThis is actually not better as in our first attempt (refer to the following tutorial), but we got a more robust matching algorithm.

It looks like that the used Twitter politician list isn’t really up-to-date in context active members of the federal assembly.

However, that’s something for the next lesson, where we want to finalize the topic of data matching and move on.

The source code you can find in the corresponding Github project, a list of all other articles here.

Happy coding then.


. More details

Leave a Reply