Cleaning Location Data with GeoPyJohn DeJesusBlockedUnblockFollowFollowingMar 4Python has over 120,000 libraries at the time of this post.
As I dive deeper into Python for my Data Science needs, I find myself installing every library I consider useful and/or interesting.
One of these libraries was GeoPy.
As its docs state, GeoPy is a simple library designed to geocode location data and get location coordinates using outside geocoders and data sources.
The perfect solution to our data cleaning problem.
What problem…and with who?I am currently involved in a three-man project to assist Eva Murray and Andy Kriebel with their MakeoverMonday Project.
If you don’t know what that is you should definitely check it out and jump in (particular if you like data-visualization).
Eva or Andy will send out a visualization with its data and an article each Sunday.
Participants then take that data and create a visualization (viz for short) of their own using any visualization tool of their choice.
The participants then share a link and image of their viz on Twitter and post on Data.
Lastly, Eva and Andy then provide feedback on those visualizations through a live webinar.
My awesome teammates on this project are Robert Crocker and Mehsam Raza Hemani.
Part of my and Mehsam’s end of the project is collecting the tweets that house the images and links of their vizzes.
Mehsam collected the last two years worth of tweets, while I use tweepy to obtain the recent tweets.
From these tweets, I pull various useful pieces of information such as the participants’ locations (if available).
One of the problems with the location data pulled from the tweets (aside from strange format issues) is that there are inconsistencies with how they are produced.
These inconsistencies made it difficult for Rob to use Tableau to plot the locations of our participants…Rob’s Tableau window showing the geographical detection issues.
So the team suggested changing the locations to latitude and longitude coordinates.
Rob then asked if the GeoPy library may help.
This was when I was glad I downloaded that random library days before.
Now you will see the implementation.
If you would just like to check out the code it is available in our GitHub Project Repo.
The Cleaning Code# Import Libraries and Classesimport pandas as pdimport numpy as npfrom geopy.
geocoders import Nominatimfrom geopy.
exc import GeocoderTimedOutGeoPy offers various geocoders.
Some of these, however, require getting API keys.
I decided to go with Nominatim since that was the easiest for me to use and didn’t need an API key.
We will also import the GeocoderTimedOut error for a function we will be using to geocode our locations.
# Load Datadf=pd.
csv',encoding='latin1')# Function used to geocode locations and override timeout errordef do_geocode(address): geopy = Nominatim() try: return geopy.
geocode(address,exactly_one=True) except GeocoderTimedOut: return do_geocode(address)# Creating Geocoded Location columndf['GeocodedLocation']=df['Location'].
apply(lambda x: do_geocode(x) if x != None else None)The data loaded at the time had a little over 1100 entries (3 weeks worth).
Using the function above we can geocode each of the locations (this is a function I modified from stackoverflow.
The geocoder has the parameter of exactly_one=True since we only want one address returned.
Without this function (as I experienced) the geocoder will timeout through the geocoding process.
The function above prevents the timeout by running again until it completes its task.
To store the geocoded locations, a column called ‘GeocodedLocation was created using the apply attribute and a lambda function to apply our do_geocode function.
Some tweets did not have a location for the participant, so I will have it return None so that the coordinate creation can be done efficiently.
# Create the Latitude Columnlat=for i in df['GeocodedLocation']: if i== None: lat.
append(None) else: lat.
astype('float')# Create the Longitude Columnlong=for i in df['GeocodedLocation']: if i== None: long.
append(None) else: long.
astype('float')Next, the latitude and longitude columns are created using loops.
Again we will return None so that Tableau can recognize the data in these columns as a location type.
The data types of each coordinate are converted to floats so it did not have to be manually done in Tableau.
# Drop GeocodedLocation Columndf=df.
drop(['GeocodedLocation'],axis=1)# Export Data to a csvdf.
csv', index=False)We will then drop the ‘GeocodedLocation’ column since it would be redundant to keep now that we have latitude and longitude coordinate columns.
Then the data is exported to a csv so the Tableau visualization can happen.
The Results of CleaningNow that we have our latitude and longitude coordinates, we were able to go from Tableau having a map that detected almost none of the locations:Before the GeoPy changes.
to all the locations so that Rob can work his Tableau magic!After the GeoPy changes!.A lot better huh?You can check out a draft dashboard that Rob made at his Tableau Public Profile.
AcknowledgmentsThanks again to Rob for resurfacing the existence of GeoPy and for him and Mesum to look into it (and for being great teammates).
Thanks also to the creators of GeoPy for making the library that solved our problem.
One more thingYou should definitely participate in MakeoverMonday.
It is a great chance to practice and improve your data visualization skills.
You also get the added benefits of getting your viz reviewed and be part of an amazingly supportive community.
Below are all the links on MakeoverMonday:1.
MakeoverMonday Website (the main site to see how to participate in the project here)2.
#MakeoverMonday Twitter Feed (ignore the other random ones on makeup and such…)3.
World (where the data sets for MakeoverMonday are housed along with viz postings)Looking forward to seeing you participate :)Until next time,John DeJesusOriginally published at www.