Get faster pandas with Modin, even on your laptops.
Scaling Interactive Pandas Workflows with Modin.
Parul PandeyBlockedUnblockFollowFollowingMar 27SourceScale your pandas workflows by changing one line of codePandas is a library which needs no introduction in the field of data science.
It provides high-performance, easy-to-use data structures and data analysis tools.
However, when working with excessively large amounts of data, Pandas on a single core becomes insufficient and people have to resort to different distributed systems to increase their performance.
The tradeoff for improved performance, however, comes with a steep learning curve.
Essentially users probably just want Pandas to run faster and aren’t looking to optimize their workflows for their particular hardware setup.
This means people want to use the same Pandas script for their 10KB dataset as their 10TB dataset.
Modin offers to provide a solution by optimizing pandas so that Data Scientists spend their time extracting value from their data than on tools that extract data.
ModinModin is an early-stage project at UC Berkeley’s RISELab designed to facilitate the use of distributed computing for Data Science.
It is a multiprocess Dataframe library with an identical API to pandas that allows users to speed up their Pandas workflows.
Modin accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks.
The system has been designed for existing Pandas users who would like their programs to run faster and scale better without significant code changes.
The ultimate goal of this work is to be able to use Pandas in a cloud setting.
InstallationModin is completely open-source and can be found on GitHub: https://github.
com/modin-project/modinModin can be installed from PyPI:pip install modinFor Windows, one of the dependencies is Ray.
Ray is not yet supported natively on Windows, so in order to install it, one needs to use the WSL(Windows Subsystem for Linux).
How Modin speeds up the executionOn a LaptopConsider a 4 core modern laptop with a dataframe that fits comfortably in it.
While pandas use only one of the CPUs core, modin, on the other hand, uses all of them.
Utilisation of cores in pandas vs modinEssentially what modin does is that it simply increases the utilisation of all cores of the CPU thereby giving a better performance.
On a Large MachineOn large machine usefulness of modin becomes much more pronounced.
Let’s pretend there is some server or some pretty powerful machine.
So pandas will still utilise a single core and again modin will use all of them.
Here is a performance comparison of read_csv with pandas and modin on a 144 core computer.
sourceThere is a nice linear scaling in pandas but that’s because it’s still only using one core.
It may be hard to see the green bars because they’re so low in modin.
Typically 2 gigabytes takes about 2 seconds and 18 gigabytes take approximately less than 18 seconds.
ArchitectureLet’s have a look into the architecture of Modin.
DataFrame PartitioningThe partitioning schema partitions along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and the number of rows supported.
SourceSystem ArchitectureModin is separated into different layers.
:Pandas API is exposed at the topmost layerNext layer houses the Query Compiler which receives queries from the pandas API layer and performs certain optimizations.
At the last layer is the Partition Manager and is responsible for the data layout and shuffling, partitioning, and serializing the tasks that get sent to each partition.
a general architecture for modinImplementing pandas API in ModinThe pandas API is massive and that is a reason probably why it has such a wide range of use cases.
pandas APIWith such a lot of operations at hand, modin followed a data-driven approach.
This means the creators of modin took a look at what people mostly use in pandas.
They went to Kaggle and did a massive scrape of all the notebooks and scripts present there and ultimately figured out the most popular pandas’ methods which are as follows:pd.
read_CSV is by far the most used method in pandas followed by, pd.
Therefore, at modin, they started implementing things and optimizing them in the order of their popularity:Currently, modin supports about 71% of the pandas API.
This represents about 93% of usage based on the study.
RayModin uses Ray to provide an effortless way to speed up the pandas’ notebooks, scripts, and libraries.
Ray is a high-performance distributed execution framework targeted at large-scale machine learning and reinforcement learning applications.
The same code can be run on a single machine to achieve efficient multiprocessing, and it can be used on a cluster for large computations.
You can find Ray on GitHub: github.
UsageImportingModin wraps pandas and transparently distributes the data and computation, accelerating the pandas’ workflows with one line of code change.
Users continue to use previous pandas notebooks while experiencing a considerable speedup from Modin, even on a single machine.
Only a modification of the import statement is needed, wherein one needs to import modin.
pandas rather than simple pandas.
import numpy as npimport modin.
pandas as pdLet’s build a toy dataset using Numpy consisting of random integers.
Notice, we don’t have to specify partitioning here.
ata = np.
randint(0,100,size = (2**16, 2**4))df = pd.
DataFrame(data)df = df.
add_prefix("Col:")When we print out the type, it is a Modin dataframe.
DataFrameif we were to print out the first 5 lines with the head command, it renders an HTML table just like pandas would.
head()ComparisonsModin manages the data partitioning and shuffling so that users can focus on extracting value from the data.
The following code was run on a 2013 4-core iMac with 32GB RAM.
read_csvread_csv is by far the most used pandas’ operation.
Let’s do a quick comparison when we use read_csv in pandas vs modin.
pandas%%timeimport pandas pandas_csv_data = pandas.
csv")—————————————————————–CPU times: user 26.
3 s, sys: 3.
14 s, total: 29.
4sWall time: 29.
5 sModin%%timemodin_csv_data = pd.
csv")—————————————————————–CPU times: user 76.
7 ms, sys: 5.
08 ms, total: 81.
8 msWall time: 7.
6 sWith Modin, read_csv performs up to 4x faster on a 4-core machine just by changing the import statementdf.
groupbypandas groupby is extremely well written and is extremely fast.
But even with that, modin outperforms pandas.
pandas%%timeimport pandas_ = pandas_csv_data.
sum()—————————————————————–CPU times: user 5.
98 s, sys: 1.
77 s, total: 7.
75 sWall time: 7.
74 smodin%%timeresults = modin_csv_data.
sum()—————————————————————–CPU times: user 3.
18 s, sys: 42.
2 ms, total: 3.
23 sWall time: 7.
3 sThe groupbyDefault to pandas implementationIn case one wants to use a pandas API that hasn’t been yet implemented or optimized, one can actually default to pandas.
This makes the system usable for notebooks that use operations not yet implemented in Modin even though there will be a drop in performance as it will use the pandas API now.
When defaulting to pandas, you will see a warning:dot_df = df.
T)It returns a distributed Modin DataFrame once the computation is complete.
DataFrameConclusionsModin is still in its early stages and appears to be a very promising add on to the pandas.
Modin handles all the partitioning and shuffling for the user so that we can essentially focus on our workflows.
Modin’s basic goal is to enable the users to use the same tools on small data as well as big data without having to worry about changing the API to suit different data sizes.