Pandas Challenge #2 — Explain the 1,000x Speed Difference when taking the MeanTed PetrouBlockedUnblockFollowFollowingApr 16In this challenge, your goal is to explain why taking the mean of the following DataFrame is more than 1,000x faster when setting the parameter numeric_only to True.
All Pandas Challenges may be found in the Dunder Data Github repository.
Master Data Analysis with PythonIf you’re interested in completely mastering data analysis using python:Get the Complete Master Data Analysis with Python Bundle (2 books, 1200 pages, 500 exercises, 50+ hours of video)Sign-up for the FREE 5-hour Intro to Pandas classVideo Available!A video tutorial of me completing this challenge is available on YouTube.
The ChallengeThe bikes dataset below has about 50,000 rows.
Calling the mean method on the entire DataFrame returns the mean of all the numeric columns.
If we set the parameter numeric_only to True, the exact same result is returned.
But, using the second option results in a speed difference of more than 1,000 times, reducing the completion time from over 40 seconds down to around 15 milliseconds.
The challenge is to explain why this speed difference exists despite each of these operations returning the exact same result.
The solution is fairly nuanced and requires a deep understanding of pandas.
>>> import pandas as pd>>> bikes = pd.
shape(50089, 19)Taking the meanCalling the mean method with the defaults is extremely slow.
239627e+03dtype: float64Setting the parameter numeric_only to True makes a huge difference even though the returned result is the same.
239627e+03dtype: float64Timing each operationThere is over 1,000x difference in performance — from 40 seconds to 15 ms>>> %timeit -n 1 -r 1 bikes.
6 s ± 0 ns per loop (mean ± std.
of 1 run, 1 loop each)>>> %timeit -n 1 -r 1 bikes.
mean(numeric_only=True)19 ms ± 0 ns per loop (mean ± std.
of 1 run, 1 loop each)SolutionThe solution relies on a thorough understanding of the object data type.
DataFrame columns that are of the object data type may contain any Python object.
Object columns may be composed of integers, floats, strings, lists, dictionaries, other DataFrames, or any other object.
Typically, columns with the object data type contain only strings, but this isn’t guaranteed.
The object data type is the most flexible and it is this flexibility that causes the tremendous slowdown above.
Integers as objectsLet’s create a Series of integers and calculate some summary statistics on it.
Note that the data type is formally a 64-bit integer after creation.
>>> s_int = pd.
Series([10, 99, -123, 88])>>> s_int0 101 992 -1233 88dtype: int64Verify the data type.
dtype>>> dtype('int64')Let’s calculate the sum and mean for the Series with integer data type.
5Within pandas, the astype method may be used to change the data type of a Series.
Let's change this Series so that its data type is object.
>>> s_obj = s_int.
astype('object')>>> s_obj0 101 992 -1233 88dtype: objectBoth the sum and mean method work for columns with object data type and return the same results from above.
5Not a good ideaTypically, you would never want to convert a Series of integers to object as you would ruin the optimizations granted to you through the numpy library.
A Series that has a data type of ‘int64’ has its data stored internally as a numpy array which stores its data directly in contiguously allocated memory using a C integer array.
By converting a Series to the object data type, each integer is no longer stored as a C integer but as a Python integer object (a much bulkier object).
Let’s verify this by retrieving the type of an individual value in each numpy array.
values)intOperations on object arrays are slowChanging the data type of a column of integers to object will have no impact on the result for several methods, but performance will decline enormously.
Below, a numpy array of 1 million integers is created.
It is then summed as both an integer data type and as an object with the object being 60x slower.
>>> import numpy as np>>> a_int = np.
randint(0, 10, 1000000)>>> a_obj = a_int.
astype('object')>>> a_intarray([3, 4, 8, .
, 8, 3, 2])>>> a_objarray([3, 4, 8, .
, 8, 3, 2], dtype=object)>>> %timeit -n 5 a_int.
sum()523 µs ± 76.
3 µs per loop (mean ± std.
of 7 runs, 5 loops each)>>> %timeit -n 5 a_obj.
4 ms ± 920 µs per loop (mean ± std.
of 7 runs, 5 loops each)Strings can be added togetherOne interesting property of strings in Python is that they can be concatenated together with the plus operator.
>>> string1 = 'mac'>>> string2 = 'hine'>>> string1 + string2'machine'String SeriesWhenever you have a column of data in pandas that contains strings, its data type will always be object.
There is no specific string data type in pandas.
Let’s create a Series of strings and verify that its data type is indeed object.
>>> s = pd.
Series(['The', 'quick', 'brown', 'fox'])>>> s0 The1 quick2 brown3 foxdtype: objectSumming a Series of stringsThe sum Series method simply adds together every value in the Series.
Because addition is a valid operation with strings in Python, the method completes on our current Series.
sum()'Thequickbrownfox'Taking the mean of a string SeriesTaking the mean of a Series of strings is meaningless and pandas will raise an error.
Let's attempt this and make a note of the error message.
TypeError: Could not convert Thequickbrownfox to numericSumming and then dividingThe error message reads ‘Could not convert Thequickbrownfox to numeric’.
This implies that pandas has taken the time to compute the sum first before trying to divide by the total length of the Series.
Why is there no error with the bikes DataFrame?You might be wondering why our bikes DataFrame did not raise an error when taking the mean, but the above Series did.
DataFrames have a concept called nuisance columns, which are columns where a calculation is unable to be computed.
These nuisance columns are silently (without an error or warning) dropped from the result.
Only columns where the operation is successful are returned.
Taking the mean of a DataFrame with columns that don’t have a mean is valid.
For instance, we can turn our string Series into a one column DataFrame with the to_frame method and then compute the mean.
Notice that there is no error here as there was above when computed on a Series with the same data.
Instead, an empty Series is returned as the one column in the DataFrame is a nuisance column.
>>> df = s.
mean()Series(, dtype: float64)Explaining what happens during the challenge problemWhen taking the mean of the bikes DataFrame above, pandas first sums every single column regardless of its data type.
Once the sum is complete, then it divides by the number of rows to get the mean of that column.
For columns of strings, it is only at this stage where the division happens that pandas is unable to compute a mean and declares it a nuisance column.
Concatenating strings is extraordinarily more expensive than adding integers or floats and since every single value in a string column is first concatenated together with the call to mean explains why the operation is so terribly slow.
There are 50,000 rows in the bikes DataFrame resulting in 50,000 string concatenations before the division occurs.
Setting the numeric_only parameter to True informs pandas to not even attempt to sum the object data type columns, which is why we see the huge gap in performance when it is used even though the result is the same.
Why can’t pandas skip columns of strings?It does seem that the logical thing to do is for pandas to skip columns where the mean is not a valid option such as columns with strings in them.
Since object columns can contain any Python object, it could be possible that the mean is a valid operation as we saw in our first Series from above.
pandas does not make any assumptions about the data contained in the object column.
It just follows its procedure for calculating the mean, which is summing the column and then dividing by the length.
If at any point an error occurs, the column is declared a nuisance and dropped from the result.
Can’t pandas build a special case for this?Yes, it would still be possible for pandas to inspect values when taking the mean of an object column and if it is a data type that does not have a mean raise the error immediately at that point.
An actual string data typeA better solution for pandas would be to have a data type just for strings.
It’s quite unfortunate that we are stuck with the infinitely flexible object data type which can never guarantee the data type of an individual value.
Many errors could be avoided and performance increased (as in this challenge) if there were a string data type.
Learn MoreIf you’ve made it this far, then you might have enjoyed this piece.
If you’d like to learn more and support my work, please consider purchasing the Complete Master Data Analysis with Python Bundle.
I believe it to be the most comprehensive set of tutorials to introduce someone to doing data analysis with Python.