What is the logic behind it?The idea revolves around the theorem that when you repeat an experiment a large number of times on a large number of random variables then the sum of their distributions will be very close to normality.
As height of a person is a random variable and is based on other random variables such as the the amount of nutrition a person consumes, the environment they live in, their genetics and so on, the sum of the distributions of these variables end up being very close to normal.
This is known as the Central Limit Theorem.
This brings us to the core of the article:We understood from the section above that the normal distribution is the sum of many random distributions.
If we plot the normal distribution density function, it’s curve has following characteristics:The bell-shaped curve above has 100 mean and 1 standard deviationMean is the center of the curve.
This is the highest point of the curve as most of the points are at the mean.
There are equal number of points on each side of the curve.
The center of the curve has the most number of points.
The total area under the curve is the total probability of all of the values that the variable can take.
The total curve area is therefore 100%Approximately 68.
2% of all of the points are within the range -1 to 1 standard deviation.
5% of all of the points are within the range -2 to 2 standard deviations.
7% of all of the points are within the range -3 to 3 standard deviations.
This allows us to easily estimate how volatile a variable is and given a confidence level, what its likely value is going to be.
Normal Probability Distribution FunctionThe probability density function of normal distribution is:The probability density function is essentially the probability of continuous random variable taking a value.
Normal distribution is a bell-shaped curve where mean=mode=median.
If you plot the probability distribution curve using its computed probability density function then the area under the curve for a given range gives the probability of the target variable being in that range.
This probability distribution curve is based on a probability distribution function which itself is computed on a number of parameters such as mean, or standard deviation of the variable.
We could use this probability distribution function to find the relative chance of a random variable taking a value within a range.
As an instance, we could record the daily returns of a stock, group them into appropriate buckets and then find the probability of the stock making 20–40% gain in the future.
The larger the standard deviation, the more the volatility in the sample.
How Do I Find Feature Distribution In Python?The simplest method I follow is to load all of the features in the data frame and then write this script:Use the Python Pandas libarary:DataFrame.
hist(bins=10)#Make a histogram of the DataFrame.
It shows us the probability distributions of all of the variables.
What Does It Mean For A Variable To Have Normal Distribution?Now what’s even more fascinating is that once you add a large number of random variables with differing distributions together, your new variable will end up having a normal distribution.
This is essentially known as the Central Limit Theorem.
The variables that exhibit normal distribution always exhibit normal distribution.
As an instance, if A and B are two variables with normal distributions then:A x B is normally distributedA + B is normally distributedAs a result, it is extremely simple to forecast a variable and find the probability of it within a range of values because of the well-known probability distribution function.
What If The Sample Distribution Is Not Normal?You can convert a distribution of a feature into normal distribution.
I have used a number of techniques to make a feature normally distributed:1.
Linear TransformationOnce we gather sample for a variable, we can compute the Z-score via linearly transforming the sample using the formula above:Calculate the meanCalculate the standard deviationFor each value x, compute Z using:2.
Using Boxcox TransformationYou can use SciPy package of Python to transform a data to normal distribution:scipy.
boxcox(x, lmbda=None, alpha=None)3.
Using Yeo-Johnson TransformationAdditionally, power transformers of yeo-johnson can be used.
Python’s sci-kit learn provides the appropriate function:sklearn.
PowerTransformer(method=’yeo-johnson’, standardize=True, copy=True)Problems With NormalityAs the normal distribution is simple and is well-understood, it is also over used in the predictive projects.
Assuming normality has its own flaws.
As an instance, we cannot assume that the stock price follows normal distribution as the price cannot be negative.
Therefore the stock price potentially follows log of normal distribution to ensure it is never below zero.
We know that the returns can be negative, therefore the returns can follow normal distribution.
It is not wise to assume that the variable follows a normal distribution without any analysis.
A variable can follow Poisson, Student-t or Binomial distribution as an instance and falsely assuming that a variable follows normal distribution can lead to inaccurate results.
SummaryThis article illustrated what normal distribution is and why it is so important, in particular for a data scientist and a machine learning expert.
Hope it helps.