SourceExplaining probability plotsIn this article I would like to explain the concept of probability plots — what they are, how to implement them in Python and how to interpret the results.
Eryk LewinsonBlockedUnblockFollowFollowingApr 161.
IntroductionYou might have already encountered one type of probability plots —Q-Q plots — while working with linear regression.
One of the assumptions of the regression we should check after fitting the model is if residuals follow Normal (Gaussian) distribution.
And it can be often visually verified by using a Q-Q plot such as the one presented below.
Example of Q-Q plotTo fully understand the concepts of probability plots let’s quickly go over a few definitions from probability theory/statistics:probability density function (PDF) —a function which allows us to calculate probabilities of finding a random variable in any interval which belongs to the sample space.
It is important to remember that the probability of a continuous random variable taking an exact value is equal to 0.
PDF of Gaussian Distributioncumulative distribution function (CDF) — a function which provides the probability of a random variable taking value equal or less than a given value x.
When we are dealing with continuous variables, the CDF is the area under the PDF in the range of minus infinity to x.
General formula for CDF, X — random variable, x — point of evaluationquantile —quoting Wikipedia: “cut points dividing the range of a probability distribution into continuous intervals with equal probabilities”The following plot presents a distribution of a random variable drawn from Standard Normal Distribution as well as the PDF and CDF.
In this article I will be using two other distributions for comparison:Normal distribution with mean 1 and standard deviation 2.
5 — N(1, 2.
5)Skew Normal Distribution with alpha = 5I use the Skew Normal Distribution since by adjusting the alpha parameter (while leaving scale and location to default) I control skewness of the distribution.
As the absolute value of alpha increases, the absolute value of skewness increases as well.
Below we can inspect the difference in distributions by looking at histograms of random variables drawn from them.
Probability plotsWe use probability plots to visually compare data coming from different datasets (distributions).
The possible scenarios involve comparing:two empirical setsone empirical and one theoretical settwo theoretical setsThe most common use for probability plots is the middle one, when we compare observed (empirical) data to data coming from a specified theoretical distribution like Gaussian.
I use this variant for explaining the particular types of plots below, however, it can also be applied to the other two cases.
1 P-P plotIn short, P-P (probability–probability) plot is a visualisation which plots CDFs of the two distributions (empirical and theoretical) against each other.
Example of a P-P plot comparing random numbers drawn from N(0, 1) to Standard Normal — perfect matchSome key information on P-P plots:Interpretation of the points on the plot: assuming we have two distributions (f and g) and a point of evaluation z (any value), the point on the plot indicates what percentage of data lies at or below z in both f and g (as per definition of the CDF).
To compare the distributions we check if the points lie on a 45 degree line (x=y).
In case they deviate, the distributions differ.
P-P plots are well suited to compare regions of high probability density (centre of distribution), because in these regions the empirical and theoretical CDFs change more rapidly than in regions of low probability density.
P-P plots require fully specified distributions, so if we are using Gaussian as the theoretical distribution we should specify the location and scale parameters.
Changing the location or scale parameters does not necessarily preserve the linearity in P-P plots.
P-P plots can be used to visually evaluate the skewness of a distribution.
The plot may result in weird patterns (e.
following the axes of the chart) when the distributions are not overlapping.
So P-P plots are mostly useful when comparing probability distributions that have a nearby or equal location.
Below I present a P-P plot comparing random variables drawn from N(1, 2.
5) and compared to N(5, 1).
Random Variables drawn from N(1, 2.
Q-Q plotSimilarly to P-P plots, Q-Q (quantile-quantile) plots allow us to compare distributions by plotting their quantiles against each other.
Some key information on Q-Q plots:Interpretation of the points on the plot: a point on the chart corresponds to a certain quantile coming from both distributions (again in most cases empirical and theoretical).
On a Q-Q plot, the reference line is dependent on the location and scale parameters of the theoretical distribution.
The intercept and slope are equal to the location and scale parameters respectively.
A linear pattern in the points indicates that the given family of distributions reasonably describes the empirical data distribution.
Q-Q plot gets very good resolution at the tails of the distribution but worse in the centre (where probability density is high)Q-Q plots do not require specifying the location and scale parameters of the theoretical distribution, because the theoretical quantiles are computed from a standard distribution within the specified family.
The linearity of the point pattern is not affected by changing location or scale parameters.
Q-Q plots can be used to visually evaluate similarity of location, scale and skewness of the two distributions.
Examples in PythonI use the statsmodels library to create the probability plots with the ProbPlot class.
P-P plotsWhen I started creating some P-P plots using statsmodels I noticed an issue — as I was comparing random draws from N(1, 2.
5) to Standard Normal, the plot was a perfect fit while it should not be.
I tried investigating this issue and found a post on StackOverflow, which explains that the current implementation always tries to estimate the location and scale parameters of the theoretical distribution, even when provided with some values.
So in the case above, we are checking if our empirical data comes from Normal distribution, not the one we specified.
That is why I wrote a function for direct comparison of empirical data to a theoretical distribution with provided parameters.
Let’s first try comparing random draw from N(1, 2.
5) to N(0, 1) using both statsmodels and pp_plot.
We see that in case of statsmodels it’s a perfect fit, as the function estimated both the location and scale parameters of the Normal Distribution.
When inspecting the result of pp_plot we see that the distributions differ significantly, which can also be observed on the histograms.
P-P plots of N(1, 2.
Standard NormalLet’s also try interpreting the shape of the P-P plot from pp_plot.
To do so I will once again show the chart, together with the histograms.
The horizontal movement along the x-axis is caused by the fact that the distributions are not entirely overlapping.
When the point is above the reference line, it means that the value of the CDF of the theoretical distribution is higher than that of the empirical one.
The next case is comparing random draw from Skew Normal to Standard Normal.
We see that the plot from statsmodels implies that it is not a perfect match, as it has troubles finding location and scale parameters of a Normal Distribution which account for the skewness in provided data.
The plot also shows that the value of the CDF of Standard Normal is always higher than that of the considered Skew Normal distribution.
P-P plots of Skew Normal (alpha=5) vs Standard NormalNote: We can also obtain a perfect fit using statsmodels.
To do so we need to specify the theoretical distribution in ProbPlot as skewnorm and pass an additional parameter distargs=(5, ) to indicate the value of alpha.
Q-Q plotsApplication and interpretationLet’s begin by comparing Skew Normal distribution to Standard Normal (with ProbPlot’s default settings).
Q-Q plots of Skew Normal (alpha=5) vs Standard NormalFirst thing that can be observed is the fact that points form a curve rather than a straight line, which usually is an indication of skewness in the sample data.
Another way of interpreting the plot is by looking at the tails of the distribution.
In this case the considered Skew Normal distribution has lighter left tail (less mass, points on the left side of Q-Q plot above the line) and heavier right tail (more mass, points on the right side of Q-Q plot above the line) than one could expect under Standard Normal distribution.
We need to remember that the skewed distribution is shifted (as can be observed on the histograms), so these results are in line with our expectations.
I also wanted to quickly go over two other variations of the same exercise.
In the first one I specify the theoretical distribution as Skew Normal and pass alpha=5 in distargs.
This results in the following plot, on which we see a linear (though shifted as compared to the standardized reference line) pattern.
However, the line pattern is basically a 45 degrees line, indicating a good fit (standardized reference line turns out not to be a good choice in this case).
Q-Q plots of Skew Normal (alpha=5) vs Skew Normal (alpha=5)The second approach is comparing two empirical samples — one drawn from Skew Normal (alpha=5), the second one from Standard Normal.
I set fit=False in order to turn off automatic fitting of location, scale and distrargs.
The results seem to be in line with the initial approach (which is a good sign 🙂 ).
Example using stock returnsI would also like to show a practical example of using Q-Q plot for evaluating whether returns generated by Microsoft stock prices follow Normal Distribution (please refer to this article for more details).
The conclusion is that there is definitely more mass in the tails (indicating more negative and positive returns) than as assumed under Normality.
Returns on MSFT vs Norma DistributionFurther implementation detailsIn the qqplot method of ProbPlot we can specify what kind of reference line we would like to draw.
The options (aside from None for no line) are:s – standardized line (expected order statistics are scaled by the standard deviation of the given sample and have the mean added to them)q – line fit through the quartilesr – regression line45 – y=x line (as the one used in P-P plots)Below I show a comparison of the three methods, which — as we can see — are very similar.
When working with Q-Q plot we can also use another feature of statsmodels which adopts non-exceedance probabilities in place of theoretical quantiles (probplot method instead of qqplot) .
You can read more about this methodology here.
Summing UpIn this article I have tried to explain key concepts of probability plots on the examples of the P-P and Q-Q plots.
You can find the notebook with code used for generating the plots mentioned in the article on my GitHub.
In case you have questions or suggestions, please let me know in the comments or reach out on Twitter.
html.. More details