The Unreasonable Effectiveness of Method Chaining in Pandas

The Unreasonable Effectiveness of Method Chaining in PandasAdiamaan KeerthiBlockedUnblockFollowFollowingJan 27How Method Chaining improves the readability of code, writing custom pipes with lambda functions to enable maximum flexibility and wrapping up with code formatting tips.

Image by TexturexIntroductionMethod chaining is a programmatic style of invoking multiple method calls sequentially with each call performing an action on the same object and returning it.

It eliminates the cognitive burden of naming variables at each intermediate step.

Fluent Interface, a method of creating object-oriented API relies on method cascading (aka method chaining).

This is akin to piping in Unix systems.

Method chaining substantially increases the readability of the code.

If you don’t believe me, let’s ask Jack and Jill.

Let’s try to tell the story of Jack and Jill using both nested function calls and Method chaining,Nested Calls:Method Chaining:One obvious advantage of Method chaining is that it is a top-down approach with arguments placed next to the function unlike the nested calls, where tracking down respective function calls to its arguments is demanding.

Pandas provide several functions for method chaining,Adding a new column to the data frame assign, renaming a column rename, filtering a data frame query etc.

Let’s look at the pedagogical winedata set.

It contains chemical composition for 178 wines.

Wine datasetThe below code starts with renaming color intensity for its shorter form ci.

It then creates a new column,color filter based on values on hue and ci.

It then filters a wine that has an alcohol content of more than 14 and passing the color filter.

In the end, it sorts the data frame based on alcohol content and displays the columns that we are interested in.

If the same where to be repeated without Method chaining, a new data frame must be created at each step.

The boon and bane of Method Chaining:Image by Sally DasoukiOne advantage that R has over Python was its tidyverse package with its rich method chaining functions.

Coupled with margrittr , you could often find a method to do the things that you want to do in a pipe.

Pandas, on the other hand, doesn’t have a comprehensible list of methods to use in method chaining.

But to make up for it, Pandas introduced Pipe function starting from version 0.

16.

2.

Pipe enables user-defined methods in method chains.

With the introduction of pipe, you can almost write anything in a method chain which begets the question, How much chaining is too much?.

This is an entirely subjective question and must be left to the discretion of the programmer.

Most people find the sweet spot to be around 7 or 8 methods in a single chain.

I don’t use any hard limits on the number of methods in a chain.

Instead, I try to represent a single coherent thought in a single method chain.

Some of the staunch critics of method chaining accuse it of increasing code readability at the cost of making debugging tricky, which is true.

Imagine a chain that’s ten methods long that you are debugging after a month.

The data frame structure or the column names have changed since then and now your chain starts throwing errors.

Its impossible to now debug through the chain and see the changes it makes to the data frame as you move along the chain, albeit you can easily find which method call is breaking the code.

This needs to be addressed before starting to use long method chains in production or in notebooks.

Combining Pipe and lambda functions:Image by GrofersFrequently I run into issues in method chaining whenever the shape of the data frame changes.

If you can track the shape along the chain it makes debugging a lot easier.

Let’s define a custom pipe function.

The key to writing a pipe function is that it should take in a data frame and return a data frame.

Two things to note in this function are the fn argument that can take in a lambda function and display function call.

Lambda function lends flexibility and the display function call makes the display of data frames and plots pretty in a Jupyter lab or a notebook setting.

In this example, we can see that we start with 13 columns and assign increases the columns to 14 and the subsequent query decreases the rows to 22.

The pipe at the end prints 5 random rows from the data frame.

This can be easily changed to a head or tail function.

Since the argument is a lambda function, it gives innumerable possibilities.

When you start writing a chain, appending a pipe with csnap function to the end helps you see the changes along the chain.

Once finished, you can remove the pipes or comment out just that line.

This is a naïve way of removing pipes.

Instead, you can use a logger object and write it an external file, if you were to move the code to production.

Example of logging to a fileLogging gives us the flexibility of not removing the pipe statements instead change the logger level to INFO to avoid getting debug information during production.

Let’s look at the other custom pipe functions.

setcols is used to set column names in a chain.

Usually when we read data from an external source the column name will contain a mix of upper and lower cases along with space and special characters.

These issues can be fixed like this,Iris data set before and after column renameUnlike the csnap function setcols function creates a copy of the data frame, which makes the function call costly.

But this is necessary to make sure that we are not writing on the global copy of the data frame.

Most of the pandas function works similarly with an in-place flag to either return the new data frame or reassign the top level reference.

Jeff Reback says,There is no guarantee that an in-place operation is actually faster.

Often, they are the same operation that works on a copy, but the top-level reference is reassigned.

Let's wrap this section with one final example.

R has a versatile select function to select/deselect columns in a wide data frame instead of listing out everything.

cfilter helps us achieve the same versatility using lambda function.

Code Formatting:Code formatting consistency is a headache when multiple people work on the same code base, with differing IDE’s.

Method chaining further complicates this process in case of multi-line function calls.

Enter Black:Image from BlackTaken from their GitHub description,Black is the uncompromising Python code formatter.

By using it, you agree to cede control over minutiae of hand-formatting.

In return, Black gives you speed, determinism, and freedom from pycodestyle nagging about formatting.

You will save time and mental energy for more important matters.

Blackened code looks the same regardless of the project you’re reading.

Formatting becomes transparent after a while and you can focus on the content instead.

Black makes code review faster by producing the smallest diffs possible.

One major advantage of using black is that it understands fluent interface and auto formats the function calls accordingly unlike any other IDE’s default formatter.

Setting up black for Pycharm:The first step is to pip install black.

In pycharm, you can integrate it either as a File watcher so that every time you save the file, the formatting is done automatically or as a pre-commit hook so that every time a commit is made the code is formatted which maintains the formatting integrity across the project.

The detail setup instruction can be found here.

Setting up black for Jupyter notebook/Lab:The first step is to pip install nb_black.

The next step is to load the appropriate extension based on the environment.

If it’s notebook then use %load_ext nb_black else for lab use %load_ext lab_black .

A quick demo of black in action,Before and after of Black FormattingReference:1) https://tomaugspurger.

github.

io/method-chaining2) https://stackoverflow.

com/questions/22532302/pandas-peculiar-performance-drop-for-inplace-rename-after-dropna/22533110#225331103) https://github.

com/ambv/black.

. More details

Leave a Reply