How to use Pandas the RIGHT way to speed up your codeGeorge SeifBlockedUnblockFollowFollowingMay 21Just a Panda, chillinThe Pandas library has been a heavenly gift to the Data Science community.
Ask any Data Scientist how they like to handle their datasets in Python and they’ll undoubtedly talk about Pandas.
Pandas is the epitome of what a great programming library should be like: easy, intuitive, and far-ranging in its capabilities.
Yet performing thousands or even millions of calculations on a Pandas dataframe, a regular duty for Data Scientists, still remains a challenge.
You can’t just throw your data in, write a Python for-loop, and expect your data to be processed in a reasonable amount of time.
Pandas is designed for vectorized operations that work on entire rows or columns in one shot — looping through each cell, row, or column is simply not the way the library was designed to be used.
As such, when working with Pandas you should be thinking about things in terms of matrix operations which are highly parallelizable.
This guide will teach you how to use Pandas the way it was designed to be used and to think in terms of matrix operations.
Along the way, I’ll show you a few practical time-saving tips and tricks that’ll get your Pandas code running way faster than those dreaded Python for-loops!Our setupThroughout this tutorial we’re going to use the classic Iris Flowers dataset.
Let’s start getting the ball rolling by loading up the dataset with seaborn and printing out the first 5 rows.
Awesome!Now let’s establish a baseline to measure our speed against with a Python for-loop.
We’ll set up a calculation to be performed on our dataset by looping through each row and then measure the speed of the whole operation.
This will give us a baseline to see just how much our new optimisations help us out.
In the above code, we created a basic function that selects the class of the flower based on the petal length using an If-Else statement.
We wrote a for-loop that applies the functions on each row by looping through the dataframe and then measured the total run-time of the loop.
On my machine which has an i7–8700k, the loop took an average of 0.
01345 seconds over 5 runs.
Looping with .
iterrows()The easiest yet very worthwhile speedup we can do right off the bat is to use Pandas’s built-in .
When we wrote our for-loop in the previous section we were using the range() function.
Yet when we are looping over a large range of values in Python, generators tend to be much faster.
You can read more about how generators work and make things faster in this article here.
iterrows() function from Pandas implements a generator function internally which will yield a row of the Dataframe on each iteration.
More precisely, .
iterrows() yields pairs (tuples) of (index, Series) for each row in the DataFrame.
This is effectively the same as using something like enumerate() in raw Python but runs much much fasterBelow we’ve modified the code to use .
iterrows() instead of a regular for-loop.
On the exact same machine I used for testing in the previous section, the average run-time was 0.
005892 seconds — a 2.
28X speedup!Dropping loops completely with .
iterrows() function got us a great boost in speed, but we’re far from finished.
Always remember that when using a library designed for vector operations, there’s probably a way to do things most efficiently without for-loops at all.
The Pandas function that offers us this capability is the .
apply() takes another function as its input and applies it along an axis of a DataFrame (rows, columns, etc).
In such cases where we are passing functions, a lambda is often convenient to package everything together.
In the code below, we have completely replaced our for-loop with .
apply() and a lambda function to package our desired calculations.
On my machine, the average run time of this code is 0.
0020897 seconds — a 6.
44X speed up over our original for-loop.
The reason that .
apply() is so much faster is that it internally tries to loop over Cython iterators.
If your function happens to be well-optimised for Cython, .
apply() will get you an even bigger speed up.
As a bonus, using the built-in functions results in much cleaner and readable codeThe final cutPreviously I mentioned that if you are using a library that’s designed for vectorized operations, you should always look for a way to do any calculations without for-loops.
Similarly, many libraries designed in this way, including Pandas, will have convenient built-in functions that perform the exact calculations you’re looking for — but way faster.
cut() function from Pandas takes as input a set of bins which define each range of our If-Else and a set of labels which define which value to return for each range.
It then performs the exact same operation we wrote manually with the compute_class() function.
Check out the code below to see how .
We’ve again gotten the sweet sweet bonus of cleaner and more readable code.
In the end, the .
cut() functions runs in an average of 0.
001423 seconds — a whopping 9.
39X speed-up over the original for-loop!Like to learn?Follow me on twitter where I post all about the latest and greatest AI, Technology, and Science!.Connect with me on LinkedIn too!Recommended ReadingWant to learn more about Data Science?.The Python Data Science Handbook book is the best resource out there for learning how to do real Data Science with Python!And just a heads up, I support this blog with Amazon affiliate links to great books, because sharing great books helps everyone!.As an Amazon Associate I earn from qualifying purchases.