Minimally Sufficient Pandas Cheat Sheet

Minimally Sufficient Pandas Cheat SheetTed PetrouBlockedUnblockFollowFollowingJan 31This article summarizes the very detailed guide presented in Minimally Sufficient Pandas.

What is Minimally Sufficient Pandas?It is a small subset of the library that is sufficient to accomplish nearly everything that it has to offer.

It allows you to focus on doing data analysis and not the syntaxHow will Minimally Sufficient Pandas benefit you?All common data analysis tasks will use the same syntaxFewer commands will be easier to commit to memoryYour code will be easier to understand by others and by youIt will be easier to put Pandas code in productionIt reduces the chance of landing on a Pandas bug.

Specific GuidanceSelecting a Single Column of DataUse the brackets and not dot notation to select a single column of data because the dot notation cannot column names with spaces, those that collide with DataFrame methods and when the column name is a variable.

>>> df[‘colname’] # do this>>> df.

colname # not thatThe deprecated ix indexerThe ix indexer is ambiguous and confusing (and now deprecated) as it allows selection by both label and integer location.

Every trace of ix should be removed and replaced with the explicitlocor iloc indexers.

Selection with at and iatThe at and iat indexers give a small increase in performance when selecting a single DataFrame cell.

Use NumPy arrays if your application relies on performance for selecting a single cell of data and not at or iat.

read_csv vs read_tableThe only difference between these two functions is the default delimiter.

Use read_csv for all cases as read_table is deprecated.

isna vs isnull and notna vs notnullisna is an alias of isnull and notna is an alias of notnull.

Use isna and notna as they end with ‘na’ like the other missing value methods fillna and dropna.

Arithmetic and Comparison Operators vs MethodsUse the operators( +, *, >, <=, etc.

) and not their corresponding methods ( add, mul, gt, le, etc…) in all cases except when absolutely necessary such as when you need to change the direction of the alignment.

Builtin Python functions vs Pandas methods with the same nameUse the Pandas method over any built-in Python function with the same name.

Standardizing groupby aggregationThere are a few different syntaxes available to do a groupby aggregation.

Use df.

groupby('grouping column').

agg({'aggregating column': 'aggregating function'}) as it can handle more complex cases.

Handling a MultiIndexA DataFrame with a MultiIndex offers little benefit over one with a single-level index.

I advise against using them.

Instead, flatten them after a call to groupbyby renaming columns and resetting the index.

The equivalency of groupby aggregation and pivot_tableA groupby aggregation and a pivot_table produce the same exact data with a different shape.

Use gropuby when you want to continue an analysis and pivot_table when you want to compare groups.

The equivalency of pivot_table and pd.

crosstabThe pivot_table method and the crosstab function are very similar.

Only use crosstab when finding the relative frequency.

pivot vs pivot_tableThe pivot method pivots data without aggregating.

It is possible to duplicate its functionality with pivot_table by selecting an aggregation function.

Consider using only pivot_table and not pivot.

The similarity between melt and stackBoth the melt and stack methods reshape the data in a very similar manner.

Use melt over stack because it allows you to rename columns and it avoids a MultiIndex.

The similarity between pivot and unstackBoth pivot and unstack work reshape data similarly but from above, pivot_table can handle all cases that pivot can, so I suggest using it over both of the others.

Best of the DataFrame APIThe above examples are the most common areas of Pandas where multiple options are available to its users.

There are many other attributes and methods that are not discussed.

Below, I provide a categorized list of the minimum amount of DataFrame attributes and methods that can accomplish nearly all of your data analysis tasks.

It reduces the number from over 240 to less than 80.

AttributescolumnsdtypesindexshapeTvaluesAggregation MethodsThese result in a single value for each columnallanycountdescribeidxmaxidxminmaxmeanmedianminmodenuniquesumstdvarNon-Aggretaion Statistical Methodsabsclipcorrcovcummaxcummincumprodcumsumdiffnlargestnsmallestpct_changeprodquantilerankroundSubset SelectionheadilocloctailMissing Value HandlingdropnafillnainterpolateisnanotnaGroupingexpandinggroupbypivot_tableresamplerollingJoining DataappendmergeOtherasfreqastypecopydropdrop_duplicatesequalsisinmeltplotrenamereplacereset_indexsampleselect_dtypesshiftsort_indexsort_valuesto_csvto_jsonto_sqlFunctionspd.










. More details

Leave a Reply