Optimizing a Sort & Match Method in Pandas

Such a wide array of individuals provides many unique perspectives and various specialized subskills which will expose you to new approaches you’d never imagine.

3.

FASTER CODE:Usually, the best method of solving a problem is the result of multiple perspectives.

With more minds solving a problem, each is able to apply their specialized knowledge and skills to a part or process.

The resulting solution is leaner, cleaner and faster.

Our collaboration, for example, led to a 200X increase in speed.

SECTION I.

The Problem at HandThe other week, someone presented a tricky pandas question at Open Python NYC.

Their dataframe contained, among others, three columns named last_name, first_name, and attr.

Each last_name would define a family and the family member with first_name == 1 would be considered the parent.

The goal was to distribute the value of attr from the parent to all of its children thereby ensuring that all family members share the same attr.

The initial (left) and desired (right) dataframes are shown:SECTION II.

That’s Easy to Solve (Hold my Beer)Here’s how each of the authors initially approached the solution.

METHOD 3: Pd.

Merge Solution (Sebastian)Sebastian: My initial intuition was to avoid explicitly iterating over the dataframe.

I knew that pandas and numpy built-in functions were implemented in a lower-level language and highly optimized, which is why I opted for a method that uses pd.

merge for the task at hand.

def method3(df): """ This method completely does without for-loops and only uses pandas-native functions.

A temporary DataFrame 'parent_attrs' is built that maps every last_name to a parent attr.

Then, this temporary DataFrame is merged with the original DataFrame 'df' on the column 'last_name' """ parent_attrs = df[df['first_name'] == '1'][['last_name','attr']] parent_attrs.

columns = ['last_name','inputted_attr'] df = df.

merge(parent_attrs, on='last_name') return dfMETHOD 4: Iterate With a Dictionary (Yaakov)Yaakov: I chose to create a dictionary with last_names as keys and attr as values by iterating over respective columns as arrays.

I then applied this dictionary to the dataframe using a lambda function.

My reasoning for creating a dictionary was to avoid memory usage and reduce indexing.

Additionally, looping to create the dictionary provides robust options for logic.

def method4(df): """ We return to using for loops, but now instead of using iterrows() we only loop over the relevant column: last_name.

Further, instead of building a list of input attrs that we then add to the DataFrame, we directly apply the mapping last_name -> parent_attrs[last_name] """ # Will return a dict of index & attr for each # Initialize your dict parent_attrs = {} i=0 # Use the attrs to speed up process, treat as an array for l_name in df['last_name'].

values: attr_now = df['attr'].

values[i] first_name_now = df['first_name'].

values[i] # Add last name to dict if first_name_now == '1': parent_attrs[l_name] = attr_now i+=1 # Finally, apply your dictionary to the df!.df['inputted_attr'] = df['last_name'].

apply(lambda x: parent_attrs[x]) return dfMETHODS 1 & 2:We chose some additional approaches too, though our hypothesis was that these would be less efficient than our chosen approaches.

Of note, these additional methods are less pythonic than our chosen methods — further motivation for innovation.

Here’s our first 4 methods:RESULTSAnd the most efficient method is…Method 3!.Sebastian’s approach to stick to the built in functions within pandas was a great choice and is ~80-100X faster than method 1.

At 10⁴ rows, Method 1 takes 12.

6 seconds where Method 3 takes 0.

126 seconds.

That’s a 100X or 10,000% reduction in computing time!Of note, Method 4 does not converge with method 3 as the number of rows or columns in the dataframe increases.

While it might with increased data complexity within the dataframe, this is not relevant since the data itself can be cleaned or reduced.

Sebastion won this one, fair and straight!SECTION III.

Good is not Good EnoughThough we had found a fast solution, we wanted something even faster.

We put our heads together and came up with additional methods for solving.

METHOD 5: Iterate with a Dictionary Without IndexingAfter analyzing Yaakov’s approach in Method 4, Sebastian identified that indexing the attr column with i was unnecessary if the columns were all referenced together.

Modifying produced Method 5.

def method5(df): """ This method is similar to method4 but instead of using a running index 'i', we loop over all three relevant columns ['attr', 'first_name', 'last_name'] simultaneously """ # Will return a dict of index & attr for each parent_attrs = {} # Use the attrs to speed up process, treat as an array for attr_now, first_name_now, l_name in df[['attr','first_name','last_name']].

values: # Add last name to dict if first_name_now == '1': parent_attrs[l_name] = attr_now # Finally, apply your dictionary to the df!.df['inputted_attr'] = df['last_name'].

apply(lambda x: parent_attrs[x]) return dfMETHOD 6: Create a Dictionary from a Pd.

filter, then ApplyAfter analyzing all approaches, the authors speculated that instead of iterating to create the dictionary, it could be produced with a pd.

filter (like the one used in Method 3) then converted to a dictionary using the pandas method.

Then, using pd.

apply (from Method 4/5), return matches for all values for the dataframe.

def method6(df): """ We combine the best of method3 and method5.

First a dictionary with the mapping rules is built (without a loop) and then applied to the DataFrame via 'apply'.

""" # Split the name string into first and last name parent_attrs = df[df['first_name'] == '1'][['last_name','attr']] parent_attrs = parent_attrs.

set_index('last_name')['attr'].

to_dict() df['inputted_attr'] = df['last_name'].

apply(lambda x: parent_attrs[x]) return dfRESULTSDid we beat our previous times?… YES WE DID!.The combination of our approaches resulted in the most optimal solution.

Method 6 is 1.

24–1.

7X faster than Method 3 (118-240X faster than Method 1).

At 10⁴ rows, Method 3 takes 0.

126 seconds where Method 6 takes 0.

102 seconds.

That’s a 1.

24X or 124% reduction in computing time!Method 1 takes 12.

6 seconds where Method 6 takes 0.

102 seconds.

That’s a 123X or 12,300% reduction in computing time!The best method is really the best of both of our methods, combined.

Conclusion:The best method for this unique sorting and matching problem in python is to create a dictionary from a pandas filter then use .

apply() to reference the dictionary.

Both authors agree that they wouldn’t have reached this method independently.

Attending the NYC Python Meetup gave us the unique opportunity to share code, ideas, have a unique discussion, and produce really fast code.

If you’re in NYC, come join us on Wednesdays at the NYC Python Meetup at the Microsoft building in Times Square.

More info is available here: https://www.

meetup.

com/nycpython/.

. More details

Leave a Reply