In reality this is not the case especially when you run a Pandas apply function as it can take ages to finish.
However, alternatives do exist which can speed up the process which I will share in this article.
Table of ContentsReasons for low performance of Pandas DataFame.
apply()Option 1: Dask LibraryOption 2: Swifter LibraryOption 3: Vectorised when possibleConclussionsReferencesReasons for low performance of Pandas DataFame.
apply() is unfortunately still limited to working with a single core.
That said the whole process is carried by a single core without utilising any of the other cores of your machine.
Naturally the first question that pop out is how can we use all the cores of our machine/cluster to run Pandas DataFame.
apply() in parallel?Option 1: Dask LibraryBy having a quick look to Dask website I was able to find the following quote:“Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love and is a flexible library for parallel computing in Python”While at the same time Dask DataFrame mimics Pandas;The simplest way is to use Dask’s map_partitions.
First you need to:pip install daskand also to import the followings :import pandas as pdimport numpy as npimport dask.
dataframe as ddimport multiprocessingBelow we run a script comparing the performance when using Dask’s map_partitions vs DataFame.
The command is pretty simple as the apply statement is wrapped around a map_partitions, there’s a compute() at the end, and npartitions have to be initialised .
Important things to notice:The execution time was halved using Dask’s map_partitionsdd.
cpu_count()) used to divide pd.
DataFrame into chunks.
In simple terms, the npartitions property is the number of Pandas dataframes that compose a single Dask dataframe.
Generally you want a few times more partitions than you have cores.
This affects performance in two main ways.
1) If you don’t have enough partitions then you may not be able to use all of your cores effectively.
For example if your dask.
dataframe has only one partition then only one core can operate at a time.
2)If you have too many partitions then the scheduler may incur a lot of overhead deciding where to compute each task.
Every task takes up a few hundred microseconds in the scheduler.
map_partitions is simply applying that lambda function to each partition.
compute() is telling Dask to process everything that came before and deliver the end product to me.
Many distributed libraries like Dask or Spark implement ‘lazy evaluation’, or creating a list of tasks and only executing when prompted to do so.
It is very important to set scheduler='processes' otherwise the computational time will increased dramatically as shown below.
Not proper settings of the Dask parameters can lead to increase of the execution time.
Dask comes with four available schedulers:“threads”: a scheduler backed by a thread pool“processes”: a scheduler backed by a process pool (preferred option on local machines as it uses all CPUs)“single-threaded” (aka “sync”): a synchronous scheduler, good for debuggingOption 2: Swifter LibrarySwifter advertise itself as:“A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.
”First you will need to pip install the library as follow:pip install swifterIt works as a plugin for pandas, allowing you to reuse the apply function, thus it is very easy-to-use as shown below and very fast:Surprisingly, it runs very fast and the reason why is that the function that we apply can be vectorised.
Swifter has the intuition to understand that.
However, in cases where the function that we apply cannot be vectorised it will automatically decide to go either with Pandas apply function or Dask (without the user having to decide the number of partitions).
Should we use parallel processing (which has some overhead), or a simple pandas apply (which only utilizes 1 CPU, but has no overhead)?In simple terms, swifter uses pandas apply when it is faster for small data sets, and converges to dask parallel processing when that is faster for large data sets.
In this manner, the user doesn’t have to think about which method to use, regardless of size of the data set.
Option 3: Vectorised when possiblePersonally speaking if you think the function that you apply can be vectorised you should vectorise the function (in our case y*(x**2+1) is trivially vectorized, but there are plenty of things that are impossible to vectorize).
It is important always to write code with a vector mindset (Broadcasting) as opposed to scalar.
he serious time savings are coming from avoiding for loops and perform operations across the entire array.
As it was expected it is faster than any other option and its performance is pretty similar with swifter.
The difference is attributed to the fact that swifter has some overhead time to identify if the function can be vectorised.
ConclusionsRather than thinking of how to get more computational power, we should think about how to use the hardware we do have as efficiently as possible.
In this article, we walked through 3 different options on how we can speed up Pandas apply function by taking full advantage of our computational power.
Thanks for reading and I am looking forward to hear your questions :)Stay tuned and Happy Machine Learning.
The complete Jupyter Notebook can be found in my below github page;https://github.
com/questions/45545110/how-do-you-parallelize-apply-on-pandas-dataframes-making-use-of-all-cores-on-oOriginally published at https://gdcoder.
com on April 30, 2019.
.. More details