Descriptive Statistics Fundamentals For Data Science AspirantsApplied statistics fundamental for Data Science aspirantsPramod ChandrayanBlockedUnblockFollowFollowingJun 29Few lines I wrote, dedicated to data engineers:Data Data everywhere, consumers are now more awareSo mine the data with utmost care, and serve them everywhere.
Yes, that valuable it is to treat and process data with the required precision, so that you can serve your customers/consumers effectively and responsibly.
In Applied statistics we try to ensure the data is reliable and clean to help us build a model which works well to find the hidden patterns.
In order to analyze the given set of input data sets, field of applied statistics broadly makes use of :1.
Inferential StatisticsToday we will cover Descriptive Statistics in detail and a little bit of inferential statistics basics.
Inferential statistics we will cover in more detail in the next, part of Applied statistics in Data SceinceDescriptive Statistics:It enables a meaningful and simpler interpretation of data, to help you visualize data in a better way(in the form of simple graphs )As per Investopedia,Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a population.
Descriptive statistics are broken down into measures of central tendency and measures of variability (spread).
Measures of central tendency include the mean, median, and mode, while measures of variability include the standard deviation, variance, the minimum and maximum variables, and the kurtosis and skewness.
Key Caveats Of Descriptive Statistics:As you can find from the above given definition, descriptive statistics are simply a way to describe our data.
However, it doesn’t allow us to make conclusions beyond the data we have analyzed(this part is handled using inferential statistics)Key mechanism a descriptive statistic employs to summarize and describe our data sets is by finding:Central Tendency in a given data setSpread of the data(variability of data)A: Measure Of A Central TendencyIt’s a way of finding/describing the central position of a frequency distribution from within the given data sets.
A measure of central tendency is a single value, that attempts to describe a set of data by identifying the central position within that set of data.
In applied statistics, a central tendency (or measure of central tendency) is a central or typical value for a probability distribution.
It may also be called a center or location of the distribution.
Colloquially, measures of central tendency are often called averages.
How Do we Measure Central Tendency ?In order to measure a central tendency in a given data , we use 3 M’sMeanMedianModeLet’s get into details of each of this quickly:Measuring Through Mean(Arithmetic Mean):These we all have used in our school/college days and also most familiar with .
As per wiki:In statistics, the arithmetic mean , or simply the mean or average when the context is clear, is the sum of a collection of numbers divided by the count of numbers in the collection.
The collection is often a set of results of an experiment or an observational study, or frequently a set of results from a survey.
The mean (or average) is the most popular and well known measure of central tendency.
It can be used with both discrete and continuous data, although its use is most often with continuous data.
The mean as you all know is equal to the sum of all the values in the data set divided by the number of values in the data set.
So, if we have n values in a data set and they have values x1, x2, …, xn, the sample mean, usually denoted by(pronounced x bar), is:This formula is usually written in a slightly different manner using the Greek capitol letter,, pronounced “sigma”, which means “sum of…”:Here x bar represents the sample mean, it is imperative to understand here is that we are talking about sample mean and not the population mean.
The sample is a small set of data, carved out of a population (which a huge collection of data set)To acknowledge that we are calculating the population mean and not the sample mean, we use the Greek lower case letter “mu”, denoted as µ:One key property of the mean is that it includes every value in your data set as part of the calculation.
In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero.
Measuring Through Median:Median Definition:Median is the middle number in a sorted list of numbers.
To determine the median value in a sequence of numbers, the numbers must first be arranged in value order from lowest to highest.
If there is an odd amount of numbers, the median value is the number that is in the middle, with the same amount of numbers below and above.
If there is an even amount of numbers in the list, the middle pair must be determined, added together and divided by two to find the median value.
The median can be used to determine an approximate average, or mean.
The median is sometimes used as opposed to the mean when there are outliers in the sequence that might skew the average of the values.
The median of a sequence can be less affected by outliers than the mean.
The median and the mode are the only measures of central tendency that can be used for ordinal data, in which values are ranked relative to each other but are not measured absolutely.
Example:Case 1: When there is a middle value which separates the entire data sets into 2 equal subsets of data.
List: 30, 13, 20, 34, 11, 22, 45Arrange the values in ascending order as given below:So now list becomes : 11, 12, 20, 22, 30, 34, 45Here median is : 22 which divides all the values equally in two halves(3 values on each side)Case 2:To find the median value in a list with an even amount of numbers, first arrange the numbers in order from lowest to highest:List: 3, 13, 2, 34, 11, 26, 47, 17Arranged in order, the list becomes: 2, 3, 11, 13, 17, 26, 34, 47The median is the average of the two numbers in the middle: 2, 3, 11, 13, 17, 26, 34, 4713 + 17 = 30 30/ 2 = 15.
Fifteen is the median value in this list of numbMeasuring through Mode:Mode:Is the value which occurs more frequently in a given data set.
To determine the mode, you might again order the scores as shown above, and then count each one.
The most frequently occurring value is the mode.
If X is a discrete random variable, the mode is the value x (i.
e, X = x) at which the probability mass function takes its maximum value.
In other words, it is the value that is most likely to be sampled.
For example, the mode of the sampleList1: 1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17Here mode is 6.
Given the list of data :List2: 1, 1, 2, 4, 4Here the mode is not unique — the dataset may be said to be bimodal, while a set with more than two modes may be described as multimodal.
Normally, the mode is used for categorical data where we wish to know which is the most common category, as illustrated below:When To Use What In Descriptive Statistics To Measure Central Tendency ?Here are following summary to know what the best measure of central tendency is with respect to the different types of variable.
Type of VariableBest measure of central tendency:For Nominal: ModeFor Ordinal: MedianFor Interval/Ratio (not skewed): MeanFor Interval/Ratio (skewed): MedianThe Case Of Skewed Distribution:Sometimes data is not normally distributed.
It is imperative that we test our data sets for its normal distribution, because this is a common assumption underlying in many statistical analysis.
When you have a normally distributed sample you can use both the mean or the median as your measure of central tendency.
In fact, in any symmetrical distribution the mean, median and mode are equal.
However, in this situation, the mean is widely preferred as the best measure of central tendency because it is the measure that includes all the values in the data set for its calculation.
sourceIn the above fig: you can observe that there is a long tail on the right side and the distribution of the data is not consistent.
We can see that the mean(10.
1) is being dragged in the direction of the skew.
In these situations, the median is generally considered to be the best representative of the central location of the data.
Remember:The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.
B: Spread Of The Data(Variability Of Data)Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item).
A measure of spread, sometimes also called a measure of dispersion, is used to describe the variability in a sample or population.
It is usually used in conjunction with a measure of central tendency, such as the mean or median, to provide an overall description of a set of data.
Measures of spread include 3 important categorization :RangeQuartiles and the interquartile range,Variance and standard deviation.
Let’s quickly cover all theseRange:The range is the difference between the highest and lowest scores in a data set and is the simplest measure of spread.
Range = maximum value — minimum valueExample : 22,45,56,32,10,9,54Here in the above data set, Max = 56, Min = 9So range = Max- Min = 56–9 = 47Range as a measure of spread is used not very popular, but it does set the boundaries of the scores.
This can be useful if you are measuring a variable that has either a critical low or high threshold or both, that should not be crossed.
In statistical analysis, the range is represented by a single number.
In financial data, this range most commonly refers to the highest and lowest price value for a given day or other time period.
Quartiles & Interquartiles Range:The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles.
Lets understand what are quartiles first and then we will get deeper into understanding IQR concept through some examplesQuartiles :Quartiles divide an ordered dataset into four equal parts, and refer to the values of the point between the quarters.
A dataset may also be divided into Quintiles (five equal parts) or deciles (ten equal parts).
A quartile is a type of quantile.
The first quartile (Q1) is defined as the middle number between the smallest number and the median of the data set.
The second quartile (Q2) is the median of the data.
The third quartile (Q3) is the middle value between the median and the highest value of the data set.
Example 1 :List = [25, 33, 14,31,54,76,57,87, 81]Let’s find the median first:Median = 54 , it separates the given data sets into to equal halvesSo Q2=54(median of whole table)Q1=14(median of upper half, from row 1 to 5)Q3=57(median of lower half, from row 5 to 9)For the above example :IQR(Inter Quartile Range ) = Q3 — Q1 = 57–14= 43Example 2: SourceData set in a plain-text box plot+−−−−−+−+ * |−−−−−−−−−−−| | |−−−−−−−−−−−| +−−−−−+−+ +−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+−−−+ number line 0 1 2 3 4 5 6 7 8 9 10 11 12For the data set in this box plot:lower (first) quartile Q1 = 7median (second quartile) Q2 = 8.
5upper (third) quartile Q3 = 9interquartile range, IQR = Q3 — Q1 = 2lower 1.
5*IQR whisker = Q1–1.
5 * IQR = 7–3 = 4upper 1.
5*IQR whisker = Q3 + 1.
5 * IQR = 9 + 3 = 12The interquartile range is often used to find outliers in data.
Outliers here are defined as observations that fall below Q1 − 1.
5 IQR or above Q3 + 1.
In a boxplot example discussed above , the highest and lowest occurring value within this limit are indicated by whiskers of the box and any outliers as individual points.
Quartiles are a useful measure of spread because they are much less affected by outliers or a skewed data set than the equivalent measures of mean and standard deviation.
For this reason, quartiles are often reported along with the median as the best choice of measure of spread and central tendency, respectively, when dealing with skewed and/or data with outliersVariance & Standard Deviation:Variance is one of the most popular way to measure the data spread of the given data set around the mean.
So let’s first try to understand what variance actual meansVariance definition:Variance (represented mathematically as σ2) is a measurement of the spread between numbers in a data set.
It measures how far each number in the set is from the mean(central tendency) and is calculated by taking the differences between each number in the set and the mean, squaring the differences (to make them positive) and dividing the sum of the squares by the number of values in the set.
In datasets with a small data spread, all values are very close to the mean, resulting in a small variance and standard deviation.
Where a dataset is more dispersed, values are spread further away from the mean, leading to a larger variance and standard deviation.
The smaller the variance and standard deviation, the more the mean value is indicative of the whole dataset.
Therefore, if all values of a dataset are the same, the standard deviation and variance are zero.
Variance Formula :The population Variance σ2 (pronounced sigma squared) of a discrete set of numbers is expressed by the following formula:where:Xi represents the ith unit, starting from the first observation to the lastμ represents the population meanN represents the number of units in the population!.Remember in above formula we are talking about the entire population of a data set.
For Sampling we calculate variance as given below:The Variance of a sample s2 (pronounced s squared) is expressed by a slightly different formula:where: xi represents the ith unit, starting from the first observation to the lastx̅ represents the sample meann represents the number of units in the sampleStandard Deviation:The standard deviation is the square root of the variance.
The standard deviation for a population is represented by σ, and the standard deviation for a sample is represented by s.
useful property of the standard deviation is that, unlike the variance, it is expressed in the same units as the data.
In addition to measuring the variability of a population, the standard deviation is also used to measure confidence in statistical conclusions.
For example, the margin of error in polling data is determined by calculating the expected standard deviation in the results if the same poll were to be conducted multiple times.
Understanding Variance & Standard Deviation By Example(Src):Let’s understand Population Variance σ2 and Standard Deviation σ with the example given belowDataset A:A = [4 , 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8]So the population mean (μ) of A: (4 + 5 + 5 + 5 + 6 + 6 + 6 + 6 + 7 + 7 + 7 + 8) / 12Mean (μ) = 6Calculate the deviation of the individual values from the mean(6 calculated above ) by subtracting the mean from each value in the dataset using the below given formula:= -2, -1, -1, -1, 0, 0, 0, 0, 1, 1, 1, 2Square each individual deviation value= 4, 1, 1, 1, 0, 0, 0, 0, 1,1,1, 4Calculate the mean of the squared deviation values=(4 + 1 +1 +1 + 0 + 0 + 0 + 0 +1 +1 +1 + 4) / 12Variance σ2= 1.
17Calculate the square root of the varianceStandard deviation σ = 1.
08Dataset B:B= [1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11]So the population mean (μ) of Dataset B:(1 + 2 + 3 + 4 + 5 + 6 + 6 + 7 + 8 + 9 + 10 + 11) / 12Mean (μ) = 6Calculate the deviation of the individual values from the mean(6 calculated above ) by subtracting the mean from each value in the dataset= -5, -4, -3, -2, -1, 0, 0, 1, 2, 3, 4, 5,Square each individual deviation value= 25, 16, 9, 4, 1, 0, 0, 1, 4, 9, 16, 25Calculate the mean of the squared deviation values=(25 + 16 + 9 + 4 + 1 + 0 + 0 + 1 + 4 + 9 + 16 + 25) / 12Variance σ2 = 9.
17Calculate the square root of the varianceStandard deviation σ = 3.
03Observation: The larger Variance and Standard Deviation in Dataset B further demonstrates that Dataset B is more dispersed than Dataset A.
Variance Vs Standard Deviation :Found one interesting infographics given below, which explains the concept beautifully:What’s Next ?We understood here about descriptive statistics where we learned how to describe/summarize effectively the given set of data( population/sample ) at initial level of EDA, using data statistics concept, before we start building our data models.
We understood the fact that the data reliability is of utmost importance if we really want to build an effective machine learning models.
Descriptive statistics only help us to build our observation around the data provided but if we really have to make intelligent predictions we can’t rely only it.
For this we have the concept in applied statistics called,Inferential statistics:Inferential statistics are concerned with making inferences based on relations found in the sample, to relations in the population.
Inferential statistics help us decide, for example, whether the differences between groups that we see in our data are strong enough to provide support for our hypothesis that group differences exist in general, in the entire population.
We will cover this in detail, in our next part of this Applied statistics for data science aspirant, what it is, and how it helps us to measure & establish data reliability to make intelligent predictions around data population/sample.
Summary:If you are looking to be an effective data science engineer please make sure you clearly understand the fundamentals of applied statistics.
Applied statistics is the foundation stepping stone which will pave a successful career path for you.
When you start understanding data sets confidently, you will be able to measure data skews, find missing values, measure data variability, which in turn will help you clean up your data to make it reliable & useful for data modeling.
Leaving you all with this food for thought:“Being a data science engineer is more about designing great processes, built around data which is more reliable and trustworthy.
“Keep reading, keep supportingThanks For Being There….. More details