Hybrid plot scales and custom violinboxplot implementationHybrid scale plots and a custom violinboxplotMadalina CiortanBlockedUnblockFollowFollowingJan 5In a nutshell, this post addresses the following 2 questions:1.

How can we create comprehensive visulizations of data distributions having far outliers?Typically we use both linear and log scales (for capturing outliers) but here we will investigate the possibility of creating hybrid axes, with an arbitrary mixture of various types of scales, applied on desired intervals.

2.

How can we combine the best from boxplots, violinplots and dynamic scales following the data distribution?We will propose a custom implementation of a violinboxplot offering a wide range of customization parameters which govern for instance the rendering of outliers, custom annotations for modes and counts, split axis between linear/log based on an arbitrary percentile.

It handles both arrays of data and dataframes grouped by a list of columns.

All examples in this blog have been implemented in python/matplotlib and are available on Github.

Hybrid matplolib axis scalesLet’s start by choosing a function (sin) which will generate the data-set we will use to demonstrate the hybrid scales concept.

x = np.

arange(-80,80, 0.

1)y = np.

sin(x)plt.

title('Linear scale plot of a sinusoid')plt.

xlabel('x')plt.

ylabel('y')plt.

plot(x, y);If we treat this data-set as a black-box, a data scientist may want, for any number of reasons, to have a dynamic resolution of the plot by using different scales on different intervals with a minimal effort.

For instance, he may want to visualize:1<= y <= 0.

5 using a linear scale0.

1 <= y <= 0.

5 using a log scale-1<= y <= 0.

1 using a linear scaleThe first naive solution is to create 3 different plots with the chosen axis scales on chosen intervals.

In this post we will investigate the capabilities of matplotlib to make appear the original plot under different scales, thus providing a unified visualization.

There are 2 approaches we will present in this post:Using an axis dividerUsing Grid specAxis dividerfrom mpl_toolkits.

axes_grid1 import make_axes_locatableMatplotlib’s function make_axes_locatable allow us to append a new axis to a given axis.

In the example below, a log axis is created from the original linear axis.

By setting arbitrary y limits we control what part of the plot is being rendered and we can create the impression of plot continuity.

Sharedx parameter allows sharing the same x axis and prevents the x tick labels from being re-rendered.

plt.

title('Split plot in 2 parts: linear: [0.

5, 1] and log: [0.

01, 0.

5]')linearAxis = plt.

gca()linearAxis.

plot(x, y)linearAxis.

set_ylim((0.

5, 1))divider = make_axes_locatable(linearAxis)logAxis = divider.

append_axes("bottom", size=1, pad=0.

02, sharex=linearAxis)logAxis.

plot(x, y)logAxis.

set_yscale('log')logAxis.

set_ylim((0.

01, 0.

5));We can use append axes on a given input axis in 4 potential location (top/ bottom/ up/ down).

The code below illustrates chaining 2 axes, on top and on bottom.

logAxis = plt.

gca()logAxis.

plot(x, y)logAxis.

set_yscale('log')logAxis.

set_ylim((0.

01, 0.

5))divider = make_axes_locatable(logAxis)linearAxis = divider.

append_axes("top", size=1, pad=0.

02, sharex=logAxis)linearAxis.

plot(x, y)linearAxis.

set_ylim((0.

5, 1))linearAxis.

set_xscale('linear')linearAxis.

set_title('Plot split in 3 scales: linear: [0.

5, 1], log: [0.

01, 0.

5], linear: [-1, 0.

01]');linearAxis1 = divider.

append_axes("bottom", size=1, pad=0.

02, sharex=logAxis)linearAxis1.

plot(x, y)linearAxis1.

set_yscale('linear')linearAxis1.

set_ylim((-1, 0.

01));GridSpec implementationAnother option is to use matplotlib’s GridSpec which provides more flexibility in terms of sizing the components and usage.

We can define upfront the number of suplots, their relative sizes (height_ratios), the distance between subplots (hspace).

Once the independent axis have been created, we can set the scales and the desired limits.

import matplotlib.

gridspec as grdgs = grd.

GridSpec(3, 1, wspace=0.

01, hspace=0.

05, height_ratios=[0.

33, 0.

33, 0.

33])ax1 = plt.

subplot(gs[0])ax2 = plt.

subplot(gs[1])ax3 = plt.

subplot(gs[2])ax1.

set_xticks([])ax2.

set_xticks([])ax1.

plot(x, y)ax1.

set_yscale('linear')ax1.

set_ylim((0.

5, 1))ax2.

plot(x, y)ax2.

set_yscale('log')ax2.

set_ylim((0.

01, 0.

5))ax3.

plot(x, y)ax3.

set_yscale('linear')ax3.

set_ylim((-1, 0.

01));Custom violinbloxplotLet’s start by generating a few data distributions reflecting multiple scenarios:unimodal data reflecting a gaussian distributiona combination of gaussian data with outliersa dataset with multiple (7 such distributions) illustrating a comparative visualization of input distributionsa dataframe to be grouped by one or multiple comumns to illustrate the compartive data distributiondata1 = [np.

round(np.

random.

normal(10, 0.

4, 50), 2)]data1SharpEnd = [[e for e in data1[0] if e > 9.

9]]data1Spread = [ np.

concatenate([ np.

round(np.

random.

normal(10, 0.

2, 1000), 2), np.

round(np.

random.

normal(80, 0.

3, 5), 2) ]) ]data2 = [ np.

concatenate([ np.

round(np.

random.

normal(10, std/10, 1000), 2), np.

round(np.

random.

normal(80, std, np.

random.

randint(0, 24) * std), 2) ]) for std in range(1, 7) ]labels7 = ['A', 'B', 'C', 'D', 'E', 'F', 'G']Based on one of the existing datasets, we can define a dataframe:df = pd.

DataFrame()df['values'] = data1Spread[0]df['col1'] = np.

random.

choice(['A', 'B'], df.

shape[0])df['col2'] = np.

random.

choice(['C', 'D'], df.

shape[0])In order to better understand the underlying data distribution, let’s create a plotting function which leverages both boxplots and violinplots:def plotDistributions(inputData, title): """ This method plots inputData with: – matplotlib boxplot – matplotlib violinplot – seaborn violinplot """ globalMax = np.

max(np.

concatenate(inputData)) globalMin = np.

min(np.

concatenate(inputData)) plt.

figure(figsize =(14, 4)) plt.

suptitle(title) plt.

subplot(121) plt.

grid() plt.

title('Matplotlib boxplot') plt.

boxplot(inputData, vert= False); plt.

axvline(x = globalMax, c ='red', label = 'Global max', alpha = 0.

5) plt.

axvline(x = globalMin, c ='red', label = 'Global min', alpha = 0.

5) plt.

legend() plt.

subplot(122) plt.

grid() plt.

title('Matplotlib violinplot') plt.

violinplot(inputData, vert= False, showmeans=False, showmedians=True, showextrema=True); plt.

axvline(x = globalMax, c ='red', label = 'Global max', alpha = 0.

5) plt.

axvline(x = globalMin, c ='red', label = 'Global min', alpha = 0.

5) plt.

legend()We can visualize the dataframe using seabornsns.

violinplot(x = 'values', y='col1', data = df)plt.

figure()sns.

violinplot(x = 'values', y='col2', data = df)However, seaborn expects to indicate as y only one column which will be used in a group by to aggregate the results.

If we want to aggregate based on a combination of multiple features, we have to do it prior to calling the plotting function.

What some drawbacks we can identify in the above plots?it would be nice to have the combined resolution of boxplots and violin plots in one graph.

Seaborn offers through the inner parameter a way to incorporate a boxplot but its customisation possibilities are limited.

as shown in the second graphic, if we are dealing with a distribution with far outliers, the overall visualization looses the details at the extremes.

What if we could used the examples discussed in the first section to create a customized unified view with arbitrary scales on target intervals?Some other points to consider are:how can we enrich plots with custom annotations indicating for instance, for each dataset the number of points and other arbitrary measures, such as the mode?could be provide a hyperparameter which could remove from the visualisation alltogether the points we consider outliers?If we start with this last point, we can come up with a method that removes all points father away than a given number of standard deviations (by default 3).

What some drawbacks we can identify in the above plots?it would be nice to have the combined resolution of boxplots and violin plots in one graph.

Seaborn offers through the inner parameter a way to incorporate a boxplot but its customisation possibilities are limited.

as shown in the second graphic, if we are dealing with a distribution with far outliers, the overall visualization looses the details at the extremes.

What if we could used the examples discussed in the first section to create a customized unified view with arbitrary scales on target intervals?Some other points to consider are:how can we enrich plots with custom annotations indicating for instance, for each dataset the number of points and other arbitrary measures, such as the mode?could be provide a hyperparameter which could remove from the visualisation alltogether the points we consider outliers?If we start with this last point, we can come up with a method that removes all points father away than a given number of standard deviations (by default 3).

def removeOutliers(data, thresholdStd = 3): """ This method returns all values which are farther away than thresholdStd standard deviationa """ noOutliers=[] mean = np.

mean(data) std =np.

std(data) if std == 0: return data for y in data: z_score= (y – mean)/std if np.

abs(z_score) <= thresholdStd: noOutliers.

append(y) return noOutliersImplementation of violinboxplotFind the complete code on github.

We will only present the results on the data-sets introduced previously.

This custom implementation renders:(optionally) in green the mode of each distribution(optionally) in back at the end of each line the count of each distributionCan render or hide outliersOptional logpercentile specifies a percentile value (0.

9) under which the rendering will be done on a linear scale, values about this percentile use logscaleThe method can take as input a list of arrays or a dataframe which can be grouped by a list of columns (y)Results using dataframes and aggregations on one or multiple columns:Future workOne future improvement is to automatically detect the modes of all rendered distribution and to estimate the PDF by using the code presented in the previous article https://github.

com/ciortanmadalina/modality_tests/blob/master/kernel_density.

ipynb .

This can give an estimation of some of the zones of interest where for instance a log scale would be good choice.

.