Scraping US Census Data via CenPy

Scraping US Census Data via CenPyWrangling Census Data without the APIAshley WhiteBlockedUnblockFollowFollowingJan 28After taking on a team project to examine the economic impact (via wage loss) of natural disasters on a specific locale, our hypothesis was to use a Seasonal Auto Regressive Integrated Moving Average (SARIMA) forecast to determine what historical employment, unemployment, and wage levels would have been sans a natural disaster.

By comparing our predictions to the actual effects post-disaster, we could approximate the economic impact of that event.

Given the time crunch, I immediately panicked thinking about the possibility of having to navigate a clunky SQL interface or manually download disparate csvs for each state to pull the information I needed.

Furthermore, I was’t even quite sure of the metrics or time frame hat I would ultimately need.

With all this uncertainty, I figured my best option would be to leverage an API or webscraper to get the data I would need.

However, my API & BeautifulSoup skills still need some work, so I began to search for an alternative.

Fortunately, the Python community is full of creative developers who have created libraries and wrappers to more easily interact with data like CenPy.

According to the creators, “CenPy (sen – pie) is a package that exposes APIs from the US Census Bureau and makes it easy to pull down and work with Census data in Pandas.

” While I will explore a couple of features that I used, I would encourage you all to check out their GitHub and introductory notebook for more information.

Getting StartedThe only imports I used for this exercise were CenPy and Pandas:import pandas as pdimport cenpy as cenBy calling explorer.

available, you can access a list of the identifiers of all the APIs that cenpy knows about.

Fortunately, I knew which API/data series I would want to access, but if you wanted to see all the available APIs you can do so via the dictionary output or convert to a Pandas dataframe for easier readability.

# Call list of available datasets, verbose = True to include dataset titledatasets = list(cen.

explorer.

available(verbose=True).

items())# Convert dictionary to dataframedatasets = pd.

DataFrame(datasets, columns = ['code', 'dataset name'])code is important here, because this will allow to specify to which database you’d like to establish a connection.

The connection class allows you to construct a query string and make requests from the Census server.

The result is then parsed into JSON and returned.

I began my initial queries with the Quarterly Work Indicators (QWI), a set of 32 economic indicators including employment, job creation/destruction, wages, hires, and other measures of employment flows.

qwi_connection = cen.

base.

Connection('QWISA')Constructing the QueryAccording to the Census API documentation, there are a couple of parameters / variables requirements you must use when calling their API: Endpoint, Indicator, Geography, and Time.

Default values for other categorical variables are assumed if not specified.

Here is an example of the minimum requirements for a fully-formed basic query:api.

census.

gov/data/timeseries/qwi/sa?get=Emp&for=state:02&year=2012&quarter=1&key=[userkey]I started by setting some default parameters:# Specify all countiesg_unit = 'county:*'# Start with one state (chosen arbitrarily) g_filter = {'state': '01'}# Specify time periodtime = 'from 2003-Q1 to 2018-Q1'# Uses .

varslike to pull in all indicator namescols = qwi_connection.

varslike('Emp') # Employmenthir = qwi_connection.

varslike('HirA') # Hiringearns = qwi_connection.

varslike('Earn') # Earningpayroll = qwi_connection.

varslike('Pay') # Payrollfirm = qwi_connection.

varslike('Frm') # Firm Job Statssep = qwi_connection.

varslike('sep') # Seperations# Extend cols to add additional variablescols.

extend(hir)cols.

extend(earns)cols.

extend(payroll)cols.

extend(firm)cols.

extend(sep)Looping over all StatesI wanted to construct a loop that would allow me to iterate over all states and concatenate all the results into one master dataframe.

I started by creating a ‘master’ dataframe upon which I could append the other states, versus creating a blank data frame to ensure the column order would match and the new queries would join appropriately.

# Create the first query / dataframe (with state 01)master = qwi_connection.

query(cols = cols, time = time, geo_filter = g_filter, geo_unit = g_unit)I then leveraged this state FIPS codes dictionary to create a list of state codes over which to iterate:state_codes = { 'WA': '53', 'DE': '10', 'DC': '11', 'WI': '55', 'WV': '54', 'HI': '15', 'FL': '12', 'WY': '56', 'NJ': '34', 'NM': '35', 'TX': '48', 'LA': '22', 'NC': '37', 'ND': '38', 'NE': '31', 'TN': '47', 'NY': '36', 'PA': '42', 'AK': '02', 'NV': '32', 'NH': '33', 'VA': '51', 'CO': '08', 'CA': '06', 'AL': '01', 'AR': '05', 'VT': '50', 'IL': '17', 'GA': '13', 'IN': '18', 'IA': '19', 'MA': '25', 'AZ': '04', 'ID': '16', 'CT': '09', 'ME': '23', 'MD': '24', 'OK': '40', 'OH': '39', 'UT': '49', 'MO': '29', 'MN': '27', 'MI': '26', 'RI': '44', 'KS': '20', 'MT': '30', 'MS': '28', 'SC': '45', 'KY': '21', 'OR': '41', 'SD': '46'}# Extract numerical state codesstates = list(state_codes.

values())Finally, I created a for loop to iterate over all state codes and joined the results with my existing master dataframe:for s in states: print(f'Scraping {s}') try: # Iterate over states 's' g_filter = {'state': s} df = qwi_connection.

query(cols=cols, time=time, geo_filer=g_filter, geo_unit = g_unit) # Concat new df with master df master = pd.

concat([master, df]) except requests.

exceptions.

HTTPError: passPlease note you can include up to 50 variables in a single API query and can make up to 500 queries per IP address per day.

More than 500 queries per IP address per day requires that you register for a Census key.

That key will be part of your data request URL string that you specify with you establish connection.

I found that my query of all the counties of all the states exceeded the overall IP limit, even after using my API key, and it timed me out of the system.

This forced me to split up my query into two parts over two days.

I hope this has been a helpful primer for you, and that you are able to explore all the possibilities of CenPy!.. More details

Leave a Reply