The Problem Of Overfitting And How To Resolve ItExplaining How To Fix The Curse Of Overfitting In PythonFarhad MalikBlockedUnblockFollowFollowingMay 8In this article, I want to explain one of the most important concepts of machine learning and data science which we encounter after we have trained our machine learning model.
It is a must-know topic.
This article aims to explain following topics:What Is Overfitting In A Machine Learning Project?How Can We Detect Overfitting?How Do We Resolve Overfitting?Photo by Isaac Smith on UnsplashIntroduction — What Is Overfitting?Let’s set the foundation of the concept first.
Let’s assume you want to predict future price movements of a stock.
You then decide to gather the historic daily prices of the stock for the last 10 days and plot the stock price on a scatter plot as shown below:The chart above shows that the actual stock prices are some-what random.
To capture the stock price movements, you assess and gather data for following 16 features which you know the stock price is dependent on:Industry performanceCompany’s news releasesCompany’s earningsCompany’s profitsCompany’s future announcementsCompany’s dividendsCompany’s current and future contracts sizeCompany’s M&A stateCompany’s management informationCompany’s current contractsCompany’s future contractsInflationInterest RatesForeign Exchange RatesInvestor SentimentCompany’s competitorsOnce the data is gathered, cleaned, scaled and transformed, you split the data into training and test data sets.
Furthermore, you feed the training data into your machine learning model to train it.
Once the model is trained, you decide to test the accuracy of your model by passing in test data set.
What Do We Expect To See?The chart above shows that the actual stock prices are random.
However the predicted stock price is a smooth curve.
It has not fit itself too close to the training set and therefore it is capable of generalising unseen data better.
However, let’s assume your plot actual vs predicted stock prices and you experience following charts:1.
A Straight Line To Show The Predicted PriceWhat Does It Show?This means that the algorithm has a very strong pre-conception of the data.
It implies that it has high-bias.
This is known as under-fitting.
These models are not good for predicting new data.
A Very Strong Closely Fitted LineWhat Does It Show?This is the other extreme.
It might look as if it’s doing a great job at predicting the stock price.
However, this is known as over-fitting.
It is also known as high-variance because it has learnt the training data so well that it cannot generalise well to make predictions on new and unseen data.
These models are not good for predicting new data.
If we feed the model new data then it’s accuracy will end up being extremely poor.
It is also indicating that we are not training our model with enough data.
Overfitting is when your model has over-trained itself on the data that is fed to train it.
It could be because there are way too many features in the data or because we have not supplied enough data.
It happens when the difference between the actual and predicted values is close to 0.
How Do I Detect Over-fitting?The models that have been over-fit on the training data do not generalise well to new examples.
They are not good at predicting unseen data.
Photo by Stephen Dawson on UnsplashThis implies that they are extremely accurate during training and yield very poor results during prediction of unseen data.
If the measure of accuracy such as mean error squared is substantially lower during training of the model and the accuracy deteriorates on the test data set then it implies that your model is over-fitting the data.
Have a look at this article if you want to understand which algorithms you can use to measure the accuracy of your machine learning model:Must Know Mathematical Measures For Every Data ScientistKey Mathematical Formulae Explained In Easy To Follow Bullet Pointsmedium.
comHow Do We Resolve Overfitting?We can randomly remove the features and assess the accuracy of the algorithm iteratively but it is a very tedious and slow process.
There are essentially four common ways to reduce over-fitting.
Reduce Features:The most obvious option is to reduce the features.
You can compute the correlation matrix of the features and reduce the features that are highly correlated with each other:import matplotlib.
pyplot as pltplt.
Model Selection Algorithms:You can select model selection algorithms.
These algorithms can choose the features with greater importance.
The problem with these techniques is that we might end up losing valuable information.
Feed More DataYou should aim to feed enough data to your models so that the models are trained, tested and validated thoroughly.
Aim to give 60% of data to train the model, 20% of the data to test and 20% of the data to validate the model.
Regularization:The aim of regularization is to keep all of the features but impose a constraint on the magnitude on the co-efficients.
It is preferred because you do not have to lose the features by penalising the features.
When the constraints are applied on the parameters, then the model is less prone to over-fitting as it produces a smooth function.
The regularization parameters, known as penalty factors, are introduced which control the parameters and ensure that the model is not over-training itself on the training data.
These parameters are set to smaller values to eliminate overfitting.
When the coefficients take large values then the regularization parameters penalise the optimisation function.
There are two common regularization techniques:LASSOAdds a penalty which is the square of the magnitude of the coefficients.
This ensures that the features do not end up applying high weight on the prediction of the algorithm.
from sklearn import linear_modelmodel = linear_model.
fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])2.
RIDGEAdds a penalty which is the absolute of the magnitude of the coefficients.
As a result, some of the weights will end up being to zero.
This means that the data of some of the features will not be used in the algorithm.
linear_model import Ridgemodel = Ridge(alpha=1.
fit(X, y)Photo by Sergey Pesterev on UnsplashSummaryThis article highlighted one of the key topics which we encounter once we are testing our machine learning model.
It provided an overview of following key sections:What Is Overfitting In A Machine Learning Project?How Can We Detect Overfitting?How Do We Resolve Overfitting?Please read this article if you want to grasp end-to-end guide for a machine learning projectEnd To End Guide For Machine Learning ProjectExplains How To Build A Successful Machine Learning Model In Simple Stepsmedium.
comHope it helps.
.. More details