Intro to Statistics — Scatter PlotsM.

EmmanuelBlockedUnblockFollowFollowingMar 23We continue the series of articles on the Udemy Intro to Statistics course with a new article about Scatter Plots.

In our first article, we reviewed a basic linear scenario where all data points laid in a perfectly straight line.

Lesson 3: Scatter PlotsLesson 3 is a short lesson which covers linearity and scatter plots.

Lesson data sets show linear and non-linear scenarios.

Although the examples and scenarios are really basic, the included source code will help to understand how to easily use Python to deal with basic data analysis and how to draw scatter plots.

A Linear Relationship has a direct implication on a simple two-variable relationship.

In this particular example, the linearity consequence is that in our sample data there is a fixed dollar amount per square foot.

We are presented with the following data set and asked if it is a fixed price per ft²:Quiz: Is there a fixed dollar amount per square foot? — Image from Udacity Intro to StatisticsWe can compute the new scatter plot, but in this case, as mentioned in the course, there is no need as it is easily spotted that for the same size (1400 ft²) there are two different values (98,000$ and 91,000$).

It obviously cannot be linear.

In real world data, it might not be so easy to spot this.

The following source code works out analytically price per square foot for each data pair, showing that there is a value which is different than the other ones.

# -*- coding: utf-8 -*-import matplotlibmatplotlib.

use(‘Agg’)import numpy as npimport matplotlib.

pyplot as pltsize = [ 1400, 2400, 1800, 1900, 1400, 1100 ]cost = [ 98000, 168000, 126000, 133000, 91800, 77000 ]ftusd = [cost[i]/s for i,s in enumerate(size)]print ftusdplt.

scatter(size, cost)plt.

xlabel(‘.Size in ft2’)plt.

ylabel(‘Price in USD.’)plt.

savefig(‘lesson3_2.

png’)Scatter plot for non linear data setChanging the one before the latest one from 1400 to 1300 makes the data set linear setting a fixed price of 70 USD per square foot.

Linear data setJust a quick visual inspect is enough to determine that we are dealing with a linear data set now.

OutliersThe lesson includes other data sets covering scatter plots.

Finally, we find an example of what is commonly known in statistics as outliers.

An outlier is basically an isolated value or a small set of isolated values which are far away from the mean, from the line representing the linear relationships or simply far from the data set pattern in case of data in a general scenario.

“An outlier is an observation that lies outside the overall pattern of a distribution” (Moore and McCabe 1999)Quiz: Is this linear? — Image from Udacity Intro to Statistics# -*- coding: utf-8 -*-import matplotlibmatplotlib.

use(‘Agg’)import numpy as npimport matplotlib.

pyplot as pltsize = [ 1700, 2100, 1900, 1300, 1600, 2200 ]cost = [ 53000, 44000, 59000, 82000, 50000, 68000 ]plt.

scatter(size, cost)plt.

xlabel(‘.Size in ft2’)plt.

ylabel(‘Price in USD.’)plt.

savefig(‘scatter3_1.

png’)Non Linear Relationship, within this context outliers are clearly identified.

We can see how the data set includes two data pairs which are far from the linear.

SummaryAlthough the examples covered here are basic ones, scatter plots are extremely useful tools to inspect data visually.

Scatter plots are extremely easy to draw and can reveal important information about our data universe.

They will highlight outliers, linearity and will reveal patterns.

This early analysis of the data is extremely important to further define actions on data as it helps analysis to focus on the relevant areas.

.