The Importance of Analyzing Model Assumptions in Machine LearningReilly MeinertBlockedUnblockFollowFollowingMay 24By Reilly Meinert, Adeet Patel, & Simon LiChecking model assumptions is essential prior to building a model that will be used for prediction.

If assumptions are not met, the model may inaccurately reflect the data and will likely result in inaccurate predictions.

Each model has different assumptions that must be met, so checking assumptions is important both in choosing a model and in verifying that it is the appropriate model to use.

DiagnosticsDiagnostics are used to evaluate the model assumptions and figure out whether or not there are observations with a large, undue influence on the analysis.

They can be used to optimize the model by making sure the model you use is actually appropriate for the data you are analyzing.

There are many ways to assess the validity of a model using diagnostics.

Diagnostics is an overarching name that covers the other topics under model assumptions.

It may include exploring the model’s basic statistical assumptions, examining the structure of a model by considering more, fewer, or different explanatory variables, or looking for data that is poorly represented by a model such as outliers or that have a large imbalanced effect on the regression model’s prediction.

Diagnostics can take many forms.

There are numerical diagnostics you can examine.

The statsmodels package provides a summary of many diagnostics through the summary function:With this summary, we can see important values such as R2, the F-statistic, and many others.

You can also analyze a model using a graphical diagnostic such as plotting the residuals against the fitted/predicted values.

Above is the fitted versus residual plot for our weight-height dataset, using height as the predictor.

For the most part, this plot is random.

However, as fitted values increase, so does the range of residuals.

This means that as BMI increases, there is higher variance between our model and the actual data.

It also tends to be a more negative residual at higher BMIs.

This does not mean that a linear model is incorrect, but it is something to investigate and maybe something to help change or improve the model.

Another residual plot you can do is a scale-location plot.

This plot shows whether our residuals are equally distributed along the range of our predictor.

If all random variables have the same finite variance, they are considered to be homoscedastic.

A plot with randomly spread points indicates the model is appropriate.

You plot square-rooted normalized residuals against the fitted values.

In this plot, we want a random distribution that is horizontally banded.

This would indicate that the data is homoscedastic and randomization in the relationship between the independent variables and the dependent variable is relatively equal across the independent variables.

Our line is mostly horizontally banded at the beginning but seems to slope upwards near the end, meaning that there may not be equal variance everywhere.

This may be a result of not fixing the issue we discovered above in the residual-fitted graph and another indicator something may need to be changed in our model.

When doing a regression model, you want to make sure that your residuals are relatively random.

If they are not, that may mean that the regression you chose was not correct.

For example, if you chose to use a linear regression and the residual plot is clearly not random, that would indicate that the data is not linear.

Diagnostics also apply to many of the other topics we are covering, such as multicollinearity, dataset distributions, and outliers which will be discussed throughout the rest of this post.

MulticollinearityIn statistics, multicollinearity occurs when a dataset’s features, or X variables are not independent from each other.

For example, height, weight, and height2 are not independent, as the calculation for height2 depends on height and vice versa.

Multicollinearity also means that there are redundant features in a dataset.

Multicollinearity is a major problem in regression analysis.

This is because the key objective of a regression model is to predict how the independent Y variable changes when one of the X variables changes (with all other X variables being held constant).

Suppose if two variables X1 and X2 are highly correlated with each other (for example, X2 = X1 + 1).

It will be impossible to change X1 without changing X2, and vice versa.

In this case, it would be difficult for a model to predict the relationship between the Y variable and each X variable (with all other X variables being held constant), because the X variables are changing together.

As a result, the model will not correctly calculate the coefficients (estimates), and thus it will not be powerful enough to identify which X variables in the dataset have the most statistical influence on the Y variable.

Fortunately, multicollinearity does not always need to be fixed.

For example, suppose you have 3 X variables (X1, X2, X3).

If X1 is strongly correlated with X2, but you’re only using X2 and X3 to build your model, then the model will be able to interpret the effects of X2 and X3 on Y without problems.

Also, if your only goal is to predict Y and you don’t need to understand the effects of each X variable on Y, then reducing multicollinearity is not necessary.

In the case that the problem of multicollinearity needs to be fixed, the best approach to use is feature selection.

In fact, feature selection does not only deal with multicollinearity!.It also increases the computational efficiency of training a model (the time it takes to train a model increases exponentially as the number of features increases).

Additionally, it reduces the risk of overfitting (redundant features means that a model is more likely to fit noise rather than actual patterns in the data)There are various techniques for performing feature selection, but they all rely on the same fundamental principle.

Ultimately, the objective is to eliminate features that have little to no effect on the Y variable, and keep the most important ones.

For example, one property that can be used is “mutual information,” which is a number ranging from 0 to 1, indicating how much two features have in common.

If X1 and X2 are independent variables, it means that neither variable can be used to obtain information about the other variable, and thus their mutual information is 0.

If one variable is a function of the counterpart variable, it means that there is an explicit mathematical mapping between the two variables (if the value of one variable is known, the value of the other can be calculated), and thus their mutual information is 1.

If one variable is a function of both the counterpart variable and other variables, their mutual information is between 0 and 1.

Multiple Linear Regression for Height and Weight versus BMIAnother view, rotated to show linearity and fit of the relationshipDataset DistributionsThe distribution of a dataset shows the different possible values for a characteristic of a population, as well as how often each outcome occurs.

Normal distributions are probably the most-well known distribution, and often appear in the real world.

In multilinear regression, it is assumed that we have multivariate normality.

Put in simple terms, each of the variables should be normally distributed.

We can check this visually by plotting the variables in a histogram.

While height and weight are not perfectly normally distributed, because we have a large enough sample size, with 10,000 total observations, we can safely assume the Central Limit Theorem holds.

If we are not sure if this data is normally distributed enough, we can check it with a Q-Q plot.

A Q–Q (quantile-quantile) plot is another diagnostic tool to determine if the distribution is normally distributed.

It plots the the quantiles from the data along the theoretical quantiles, along with the line y = x.

If the points line up along this line, then the distributions are relatively similar.

In our plots below, because most of the points for different independent variables fall very closely to the line and thus “ideal” normal conditions, we can assume our data is normally distributed.

However, because at the lower ends, some points are under the line and at the higher end, some of the points are above the line, we know that our data may have heavy tails and we then adjust our model for this.

Sample SizesBefore the information explosion, statisticians used to manually collect data, which required valuable time and resources.

The minimum sample size would need to be determined ahead of time in order to ensure that enough data was collected in order to conduct an effective and accurate analysis.

Today, the opposite is often the case.

We have access to datasets with anywhere from a few thousand to a few million observations.

At first thought, being able to conduct an analysis with over a million observations seems like it’d be great.

However, when analyzing and modeling data, using a massive amount of data is often not appropriate.

There are several reasons to take a sample from a dataset.

Samples that are too large can cause us to overfit our model.

Having too many samples can cause variables that are actually insignificant have statistical significance in an analysis.

However, taking a sample that is too small from our dataset may also cause problems.

An analysis conducted on too small a sample will lack statistical power, which is essential in being able to make accurate predictions based on the model.

We don’t want a sample that is too big or too small, so how do we determine what an appropriate sample size is?Oftentimes, it is believed that a sample size of 30 is large enough.

However, when we take a random sample of 30 from our dataset, this is the result:It is easy to see that this sample is not normally distributed, which breaks the assumption of multivariate normality.

Therefore, we need to choose a larger sample size.

In model building, it is very easy to change the size of the random sample you are selecting and visually verifying that it is a large enough sample to meet the assumption of multivariate normality.

OutliersThere are no specific assumptions about outliers in model creation, but it is important to note that outliers can greatly influence your model and alter its effectiveness.

A simple way to visually check for outliers is using a boxplot, as shown here:Because we can visually see that there are outliers, we should check to see how much they influence the model.

The statsmodel.

api package calculates diagnostics such as leverage and Cook’s distance of each point that are very helpful.

Leverage is a measure of how far away the independent variable values of a point are from the values of the different observations.

Points with high-leverage are points at extreme values of the variables where the lack of nearby observations leads the fitted regression model to pass close to that particular point.

Below is a graph of Cook’s distance for each point.

Cook’s distance is a measurement of the effect on the regression of deleting a point and so given this information, it would be good to investigate those points with extreme/higher Cook’s distances.

There are several ways to deal with outliers, and how you choose to deal with them probably depends on your specific model.

They can be completely removed from your data when you create your model, or they may indicate that another model may be more appropriate for your data, depending on how they affect the other assumptions.

ConclusionAs you can see, checking model assumptions is a relatively simple, but hugely important step in optimizing model performance and increasing model reliability in machine learning.

Prior to building your model, check to see if your data meets the specific assumptions that go with your chosen model.

Start with a visual check.

If your visualizations are even a bit unclear on whether or not your data meets the specific assumption you are checking for, use a more specific diagnostic tool to either confirm or deny your suspicions.

This way, you can assure that you are using the most appropriate model for your data, which will lead to better prediction capabilities.

Key Vocabulary: multicollinearity, homoscedasticity, outlier, residual, diagnostic, scale-location plot, Q-Q plot, Cook’s DistanceSourceshttps://statisticsbyjim.

com/regression/multicollinearity-in-regression-analysis/Why, How and When to apply Feature SelectionModern day datasets are very rich in information with data collected from millions of IoT devices and sensors.

This…towardsdatascience.

comhttps://data.

library.

virginia.

edu/diagnostic-plots/.