Quick and Dirty Data Gathering with Python

Quick and Dirty Data Gathering with PythonRitvik KharkarBlockedUnblockFollowFollowingApr 10I was walking to the grocery store the other day and couldn’t believe my eyes.

Another Starbucks had opened up seemingly overnight on the block next to my house.

And … this new Starbucks was literally right between two blocks which each had their own Starbucks!At the rate new Starbucks were popping up, I found myself wondering: “How many Starbucks could there be in Los Angeles County?”A quick Google search didn’t give me exactly what I wanted.

Either the result was just in the City of LA (not the whole county) or the estimate was outdated by 5+ years.

I knew what I had to do: gather the data myself!I decided to start by visiting the Starbucks Store Locator, hoping that it might give me the information I needed.

Starbucks Store LocatorI got … sort of … what I wanted.

I found out that by typing a zip code into the search bar, I got a list of 100 Starbucks stores centered at that zip code.

By looking at the URL at the top of the page, I found out which API call (fancy name for the command sent to the Starbucks website) I could use to access this list of 100 stores centered at zip code 90815:https://www.

starbucks.

com/store-locator?place=90815Based on that, I came up with a plan for how to get information on all the Starbucks stores in LA County:1.

Get a list of all zip codes in LA County2.

For each zip code, call the above API and parse through the returned 100 Starbucks stores3.

Remove duplicates from the big list of stores (since there might be a lot of overlap between one API call and the next)4.

Remove any stores which do not lie in LA County (since some zip codes lie right on the border of LA County and might include stores from neighboring counties)Sounds reasonable?.Let’s get started!First I wanted to provide the list of Python libraries that will be needed for this project:Get a list of all zip codes in LA CountyThis one was pretty easy and you can get this list from many different sources.

I chose to get it straight from the county.

I loaded these zip codes (one per line) in a text file called laZips.

txt.

Call API for each zip codeNext, we want to call the Starbucks Store Locator API for each zip code we have gathered:Scraping all Starbucks stores in LA CountyYou may have noticed the magical ‘processResponse’ function which takes the contents returned by Starbucks and converts that text into information about each of the 100 stores.

In reality, this is just a bunch of text processing and I encourage you to check it out in the full code.

At this point, the variable allStores is a list of store information, each of which looks like this:Remove duplicate storesOne issue is that our master list of Starbucks stores, allStores, is going to include (potentially many) duplicates.

Not too big of a deal!.Let’s loop through and remove any stores where we have already encountered the store id.

At this point, the variable laStores contains no duplicate stores.

Remove stores outside of LA CountyWe do have one more issue with out list.

It is possible that a zip code is on the outskirts of LA County and so 100 Starbucks stores centered at this zip code will include stores in a neighboring County.

How do we fix this?One option would be to use the ‘City’ field in our store info and match that up with some list of cities in LA County that we gather.

This might work, but I worry about cases where there are slight discrepancies in city name between our data and the list of city names we pull from the web.

Let’s go with a more direct approach here.

That is, we will use a geojson file of LA county (basically a json file which accurately defines a complex shape such as LA County).

I’ll post this file on my GitHub.

Before doing the data cleanup, let’s make a map to make sure that we actually do have stores outside of LA County to worry about.

Yea … clearly we have stores outside LA County (highlighted in blue).

Let’s get rid of ‘em!Let’s regenerate the map using keepLAStores instead:Yay!.Problem fixed.

(I’ll be writing a tutorial on how to create maps like these soon!)Whew!.Now that we did all that work, let’s store our data in a csv in case we want to use it again in the future:So … how many Starbucks are there in LA County?728.

All code and necessary files used in this project can be found here.

Happy data gathering!.

. More details

Leave a Reply