# Effective Data Visualization for other Humans

What insights does this table provide into the correlations between prestige and occupations?We can narrow our focus and use data visualization to help us understand the complexity in data by deconstructing our collection into smaller pieces.

Choosing the Right Visual MattersCredit: Harry Quan — UnsplashIrrespective of the accuracy of its content, choosing the wrong chart can convey misleading and potentially harmful messages to your audience.

We have the responsibility to uphold the integrity of the messages we send.

Additionally, using the wrong visual can misclassify otherwise accurate information.

We can minimize these errors by classifying our data under the correct data visualization.

Bar Charts for Comparing Numeric QuantitiesA practical approach to compare quantities between categories is to use a bar chart.

In our data set, we can compare the number of occupation types among our collection of occupations.

The x-axis represents the different categories, and the y-axis indicates the unit of measurement.

Bar charts are easy to interpret and allows the end user to quickly compare the number or frequency of different discrete categories of data.

Discrete data refers to something we count, rather than measure.

For example, we have six wc occupation types.

We don’t have 6.

3333 wc occupation types.

Here’s the code used to generate the bar graph in Python3.

from matplotlib import pyplot as plttypeCounts = df['type'].

value_counts()typeCounts.

plot(kind='bar', title='Type Counts')plt.

xlabel('Type')plt.

ylabel('Number of Occupation Type')plt.

show()Histograms for Visualizing Distributions for Continuous DataContinuous data refers to measurements along a scale.

In contrast to Discrete data, Continuous data, or quantitative data, is represented by values that are measured rather than something we count.

For our data set, percentage values for income, education, and prestige are recorded to quantify a measurement on a category in the data sample.

Using a histogram, we can effectively show our viewers which percentage group yielded the highest frequency to graduate high school.

Using histograms, we can effectively show the distribution or shape of our set of continuous data.

Our distribution of high school graduates shows more occupations were more common in the lower percentage groups of students who graduated from high school.

Here’s the code used to generate the histogram in Python3.

from matplotlib import pyplot as pltdf['education'].

plot.

xlabel('Percentage')plt.

ylabel('Frequency')plt.

show()Pie Charts for Comparing Relative QuantitiesPie charts can be useful when you are trying to compare parts of a whole.

Compared to bar charts, pie charts can be used to define a visual representation of the full context of a particular category.

We can use this to compare relative quantities by separating these values into categories.

When used effectively, pie charts can effectively display a proportion to the whole set of data.

Comparing the frequencies between occupation types from the bar chart, we can generate a more effective visualization for proportions using a pie chart.

In comparison to the bar chart, proportions are easier to identify in a pie chartScatter Plots for Comparing Quantitative ValuesWe can classify quantitive data as something you can categorize.

In other words, this is information that contains unique qualities that can you can identify and categorize your data set into visible parts.

Scatter plots are compelling when you want to visualize and explore relationships between numeric features.

Additionally, scatter plots are useful in identifying data points that are outliers, values that are significantly outside the range of your observed data set.

Let’s take a look at the occupations data by visualizing how prestige compares to income levels.

Notice the top left and bottom right are mostly empty.

Higher income generated higher prestige evaluated, and low-income occupations made less prestige.

With this method, you can effectively demonstrate how two variables are correlated using a scatter plot.

You can adequately visualize the extent of a relationship and highlight to a viewer how one variable may affect another.

Here is the corresponding code to generate the scatter plot:from matplotlib import pyplot as pltoccupations = df[['prestige', 'income']]occupations.

plot(kind='scatter', title='Occupations', x='prestige', y='income')plt.

xlabel('Prestige')plt.

ylabel('Avg Income %')plt.

show()Line Charts for Changes in ValuesWe often want to evaluate our data sets over some time.

Line Charts are a useful visualization technique to how values change along with a series.

For this example, we’ll analyze US macroeconomic variables between the years 1947–1962 from the Longley Dataset from the Datasets Package.

import statsmodels.

api as smdf = sm.

datasets.

longley.

datadf [TOTEMP] [GNPDEFL] [GNP] [UNEMP] [ARMED] [POP] [YEAR]0 60323.

0 83.

0 234289.

0 2356.

0 1590.

0 107608.

0 1947.

01 61122.

0 88.

5 259426.

0 2325.

0 1456.

0 108632.

0 1948.

02 60171.

0 88.

2 258054.

0 3682.

0 1616.

0 109773.

0 1949.

03 61187.

0 89.

5 284599.

0 3351.

0 1650.

0 110929.

0 1950.

04 63221.

0 96.

2 328975.

0 2099.

0 3099.

0 112075.

0 1951.

05 63639.

0 98.

1 346999.

0 1932.

0 3594.

0 113270.

0 1952.

06 64989.

0 99.

0 365385.

0 1870.

0 3547.

0 115094.

0 1953.

07 63761.

0 100.

0 363112.

0 3578.

0 3350.

0 116219.

0 1954.

08 66019.

0 101.

2 397469.

0 2904.

0 3048.

0 117388.

0 1955.

09 67857.

0 104.

6 419180.

0 2822.

0 2857.

0 118734.

0 1956.

010 68169.

0 108.

4 442769.

0 2936.

0 2798.

0 120445.

0 1957.

011 66513.

0 110.

8 444546.

0 4681.

0 2637.

0 121950.

0 1958.

012 68655.

0 112.

6 482704.

0 3813.

0 2552.

0 123366.

0 1959.

013 69564.

0 114.

2 502601.

0 3931.

0 2514.

0 125368.

0 1960.

014 69331.

0 115.

7 518173.

0 4806.

0 2572.

0 127852.

0 1961.

015 70551.

0 116.

9 554894.

0 4007.

0 2827.

0 130081.

0 1962.

0Reviewing the types of data available here, we have a set of 16 observations containing the following six features:TOTEMP: Total EmploymentGNPDEFL: GNP deflatorGNP: GNP (Gross National Product)UNEMP: Number of unemployedARMED: Size of armed forcesPOP: PopulationYEAR: Year (1947–1962)Let’s create our visualization of the population change for our sample data.

With this approach, you can quickly convey to your audience trends or patterns within your data.

From this chart, we can see the population sizes increase from year to year and shows an increasing trend.

Here is the corresponding code to generate the line graph:df.

plot(title='Population', x='YEAR', y='POP')plt.

xlabel('Year')plt.

ylabel('Population')plt.

show()Box Plots for Visualizing a DistributionAlso known as the box and whiskers plot, you can utilize this approach for data sets that only have one variable.

This data referred to as Univariate data.

Let’s look at the unemployment data set from the Longley dataset as plain-text and a box plot.

Transforming patternless text into a perceptible visual form.

[Unemployment]0 2356.

01 2325.

02 3682.

03 3351.

04 2099.

05 1932.

06 1870.

07 3578.

08 2904.

09 2822.

010 2936.

011 4681.

012 3813.

013 3931.

014 4806.

015 4007.

0Box plots distribute data across quartiles and show where the data between the 25th and 75th percentile lie.

Within the box of the plot, we can visualize the range of data within the 50th percentile of values along with the median, shown in the green line.

Here is the corresponding code for the box plot:import pandas as pdimport statsmodels.

api as smfrom matplotlib import pyplot as pltdff = sm.

datasets.

longley.

DataFrame({'Unemployment': dff['UNEMP']})print(df) # tableplt.

figure()df['Unemployment'].

plot(kind='box',title='Unemployment')Comparing and Scaling Your VisualsTo get the most value from your data, you want to maximize the value from the information you have by comparing relationships among your data.

Let’s utilize some of the techniques we explored so far to optimize the effectiveness of our data visualizations.

Comparing Variable DistributionsWhat if we used our box plots to compare distributions among two or more variables?.Having multiple variables are referred to as bivariate (2) variables or multivariate (2+) variables.

Let’s compare the distribution of unemployment, total employment, and GNP data from our sample.

[Unempl.

] [Total Emp] [GNP]0 2356.

0 60323.

0 234289.

01 2325.

0 61122.

0 259426.

02 3682.

0 60171.

0 258054.

03 3351.

0 61187.

0 284599.

04 2099.

0 63221.

0 328975.

05 1932.

0 63639.

0 346999.

06 1870.

0 64989.

0 365385.

07 3578.

0 63761.

0 363112.

08 2904.

0 66019.

0 397469.

09 2822.

0 67857.

0 419180.

010 2936.

0 68169.

0 442769.

011 4681.

0 66513.

0 444546.

012 3813.

0 68655.

0 482704.

013 3931.

0 69564.

0 502601.

014 4806.

0 69331.

0 518173.

015 4007.

0 70551.

0 554894.

Here are the corresponding box plots.

Notice anything that could be misleading with our box plots?No good!.This data is not scaled correctly, giving misleading results.

What does this plot tell you?.I’m not sure either.

This plot is no good.

Our data not scaled in proportion to each other.

Scaled proportionallyWhen comparing data, the variables in observation must have the same measure and scale to each other.

Be sure to scale your data proportionally.

This data is not particularly useful, but it is to demonstrate data on the same scale.

Effectively Highlight a TrendSometimes with data sets, you’ll want to emphasize particular trends through your data visualizations.

Earlier, we went over how to scatter plot your data.

With many data points, using a trendline is an effective technique to make a trend clear within your data.

Here we’ll construct a scatter plot along with a trendline for Total Employment over our time period.

TrendlineHere we make it apparent to our end user a colinearity exists between Total Employment and our time period.

Colinear relationships are variables whose value increases and decreases by another.

ConclusionsCredit: Frank Mckenna — UnsplashWe are perpetually innovating solutions to simplify the complexities within our data and uncover patterns that are not observable otherwise.

To manage data effectively, we’ve explored ways we deconstruct data into manageable pieces that the human brain can effectively digest.

For software developers, analysts, product managers, and researchers alike, how we communicate our findings to an audience is what ultimately drives value into the work we do for the greater whole.

Thank you.

If you wish to take a deeper dive into mathematics and data visualization, I recommend taking Microsoft’s EdX DAT256: Essential Mathematics for Machine Learning.

ReferencesLongley dataset — statsmodels v0.

10.

0rc2+14.

gceea76ce6 documentationThe Longley dataset contains various US macroeconomic variables that are known to be highly collinear.

It has been used…www.

statsmodels.

orgCLPS 1091 S01: Research Methods And DesignWe all are forgetful.

I’m sure we all have had that moment where we went into a room and forgot why we went there in…blogs.

brown.

eduThe Datasets Package — Statsmodels v0.

10.

0rc2+8.

g551671aa1 documentationThe Rdatasets project gives access to the datasets available in R’s core datasets package and many other common R…www.

statsmodels.

orgCourse | DAT256x — Module04 | edXEssential Math for Machine LearningWhat Great Data Visualization Looks Like: 12 Complex Concepts Made EasyPerhaps today was a rough day.

Maybe you woke up late.