Keeping Pandas DataFrames clean when importing JSONKarl LoreyBlockedUnblockFollowFollowingMar 16Photo by Samuel Zeller on UnsplashAt First Momentum, I do a lot of data analysis to find the most promising young startups.
As a first step, you always have to import the desired data into a Pandas DataFrame and do some preprocessing, for example by importing JSON data from some API.
When importing JSON data, you usually have a lot of temporary columns in your DataFrame which can get messy quickly.
To deal with temporary columns, I built a custom Context Manager that keeps track of all imported columns and deletes them when you’re done.
This way, your code stays lean and you don’t have to remove temporary columns yourself.
In this short article, I will show you how.
You can find the code on GitHub.
As an example, I will use the actual code I use for importing data from the API of our CRM named Hubspot.
What I retrieve is a list of companies stored as a list of Python dictionaries.
To import a list of dictionaries (data) into a Pandas DataFrame you basically do:from pandas.
json import json_normalizedf = json_normalize(data)The json_normalize function generates a clean DataFrame based on the given list of dictionaries, the data parameter, and normalizes the hierarchy so you get clean column names.
This is especially useful for nested dictionaries.
Ugly: Keeping imported columnsThe problem with json_normalize is that you usually only want a subset of the imported columns, mostly with different names or some kind of pre-processing, too.
So you might be tempted to do something like this:from pandas.
json import json_normalizedf = json_normalize(data)df['company_id'] = df['companyId']df['location'] = df['properties.
value']df['name'] = df['properties.
value']df['domain'] = df['properties.
This works, but keeps all the imported columns inplace and might take a lot of storage.
So what can you do?Ugly: Dropping columns manuallySo after importing, you want to get rid of all temporary columns from the import.
To do this, you have to either select the columns you want or drop all columns you don’t want.
In both cases, you have to somehow keep track of the temporary columns or the ones you want to keep.
To deal with this, one solution would be to prefix temporary columns and delete them afterwards:from pandas.
json import json_normalizedf = json_normalize(data)// make temporary columnsdf.
columns = ['temp_' + c for c in df.
columns]// pre-processing, basic calculations, etc.
df['company_id'] = df['temp_companyId']df['location'] = df['temp_properties.
value']df['name'] = df['temp_properties.
value']df['domain'] = df['temp_properties.
Afterwards, you would then select all desired columns or drop all undesired columns.
drop([c for c in df.
columns if c.
startswith('temp_')], axis=1, inplace=True)// ordf = df[[c for c in df.
columns if not c.
startswith('temp_')]]While this works, it feels bloated and inefficient.
You have to prefix all the value names in the code which results in bloated column names.
You also have to keep track of column names you want in the end or the used prefix in different places.
Just imagine you have to change the prefix temp_ one day or make the code work with a different prefix.
Clean and easy: using a Context ManagerAfter having used the above methods for some time, it struck me that Python Context Managers might be a cleaner solution.
You might know them from their most popular application with open() as file:.
If not, please take a few minutes to read more about them.
To make things short: They basically ensure that something, usually a cleanup, is executed in each exit scenario, whether it is a usual exit like a return or an exception.
I thought I might use this to build a clean solution that keeps track and gets rid of temporary columns.
So I built a Context Manager that deals with temporary columns when importing JSON data so I don't have to.
You can basically use it like this:with DataFrameFromDict(companies) as df: // imported dict now in df, same result as json_normalize df['company_id'] = df['companyId'] df['location'] = df['properties.
value'] df['name'] = df['properties.
value'] df['domain'] = df['properties.
value']// after context exits, df contains company_id, location, name, and domain// but no more temporary columnsprint(df)The benefit: You don’t have to keep track anymore and the context manager handles the deletion of all temporary columns.
How it worksYou can just copy and paste the following snippet to get going, I’ll explain how it works below:class DataFrameFromDict(object): """ Temporarily imports data frame columns and deletes them afterwards.
""" def __init__(self, data): self.
df = json_normalize(data) self.
columns = list(self.
values) def __enter__(self): return self.
df def __exit__(self, exc_type, exc_val, exc_tb): self.
drop([c for c in self.
columns], axis=1, inplace=True)When opening the context, __init__ and __enter__ get called.
They create the DataFrame and remember all imported and thus temporary column names.
When the context is exited, __exit__ makes sure to drop all previously created columns and leaves only the newly created columns behind.
Hope this helps you to create a clean pre-processing pipeline.
Let me know what you think.
This post originally appeared on my blog.
Further reading:json_normalizePython Context Managers.. More details