Supervised Learning: Basics of Linear Regression

Supervised Learning: Basics of Linear RegressionVictor RomanBlockedUnblockFollowFollowingJan 151.

IntroductionRegression analysis is a subfield of supervised machine learning.

It aims to model the relationship between a certain number of features and a continuous target variable.

In regression problems we try to come up with a quantitative answer, predicting the prices of a house or the number of seconds that someone will spend watching a video.


Simple Linear Regression: Fitting a Line Through DataHaving a set of points, the regression algorithm will model the relationship between a single feature (explanatory variable x) and a continuous valued response (target variable y).

It will do it by settting an arbitarily line and computing the distance from this line to the data points.

This distance, the vertical lines, are the residuals or prediction’s errors.

The regression algorithm will keep movig the line through each iteration, trying to find the best-fitting line, in other words, the line with the minimum error.

There are several techniques to perform this task and the ultimate goal is to get the line closer to the maximum number of points.


1 Moving The Line2.


1 Absolute TrickWhen having a point and a line, the goal is to get the line closer to this point.

For achieving this task, the algorithm will use a parameter called “learning rate”.

This learning rate is the number that will be multiplied to the function parameters in order to make small steps when approximating the line to the point.

In other words, the learning rate will determine the length of the distance covered in each iteration that will get the line closer to the point.

It is commonly represented as the α symbol.



2 Square TrickIts bases on the following premise: If there is a point closer to a line, and the distance is small, the line is moved a little distance.

If it is far, the line will be moved a lot more.


Gradient DescentLet us say that we have a set of points and we want to develop an algorithm that will find the line that best fits this set of points.

The error, as stated before, will be the distance from the line to the points.

The line is moved and the error is computed.

This process is repeated over and over again, reducing the error a little bit each time, until the perfect line is achieved.

This perfect line will be the one with the smallest error.

To minimize this error, we will use the gradient decent method.

Gradient descent is a method that, for each step, take a look at the different directions which the line could be moved to reduce the error and take the action that reduces most this error.

Note: The gradient of an escalar field (f), is a vectorial field.

When it is evaluated in a generic point of the domain of f, it indicates the direction of quicker variance of the field f.

So the gradient descent will take a step in the direction of the negative gradient.

When this algorithm has taken sufficient steps it will eventually get to a local or global minimum, if the learning rate is set to an adequate value.

If the learning rate is too high, the algorithm will keep missing the minimum, because it will take too large steps.

And if it is too low, it will take infinite time to get to the point.


Mini Batch Gradient Descent4.

1 Batch Gradient DescentWhen applying the squared or absloute trick to all data points, we get some values to add to the weights of the model, add them and then update the weights with the sum of those values.


2 Stochastic Gradient DescentWhen the gradient descent is done point by point.


3 Gradient Descent Method Used in PracticeIn practice, neither of the previous methods is used, becaused both are slow computationally speaking.

The best way to to perform a linear regression, is to split the data into many small batches.

Each batch, with approximately the same number of points.

Then use each batch to update the weights.

This method is called Mini-Batch Gradient Descent.


Higher DimensionsWhen we have one input column and one outpt column, we are facing a two-dimensional problem and the regression is a line.

The prediction will be a constant by the independent variable plus other constant.

If we have more input columns, it means that there are more dimensions and the output will not be a line anymore, but planes or hyperplanes (depending on the number of dimensions).


Multiple Linear RegressionIndependent variables are also known as predictors, which are variables we look at to make predictions about other variables.

Variables we are trying to predict are known as dependant variables.

When the outcome we are trying to predict depends on more than variable, we can make a more complicated model that takes this higher dimensionality into account.

As long as they are relevant to the problem faced, using more predictor variables can help to get a better prediction.

As seen before the following image shows a simple linear regression:And the following picture shows a fitted hyperplane of a multiple linear regression with two features.

As we add more predictors, we add more dimensions to the problem and it becomes harder to visualize it, but the core of the process remains the same.


Linear Regression WarningsLinear regression comes with a set of assumptions and is not the best model for every situation.

a) Linear Regression works best when data is linear:It produces a straight line from the training data.

If the relastionship in the training data is not really linear, you will need to either make adjustments (transforming training data), add features or use other model.

b) Linear Regression is sensitive to outliers:Linear regression tries to fit a best line among the training data.

If the dataset has some outlying extreme values that do not fit a general pattern, linear regression models can be heavily impacted by the presence of outliers.

We will have to watch out for these outliers and normally remove then.

One common method to deal with outliers is to use and alternative method of regression which specially robust against this extreme values.

This method is called RANdom Sample Consensus (RNASAC) algorithm, which fits the model to the inliers subset of data.

The algorithm performs the following steps:It selects a random number of samples to be inliers and fit the model.

It test all other data points against the fitted model and add the ones that fall within the user-chosen value.

Repeats the fitting of the model with the new points.

Compute the error of the fitted model against the inliers.

End the algorithm if the perfomance meets a certain user-defined treshold or a number of iterations is reached.

Otherwise, it goes back to the first step.


Polynomial RegressionPolynomial regression is a special case of multiple linear regression analysis in which the relationship between the independetn variable x and the dependent variable y is modelled as an nth degree polynomial in x.

In other words, when our data distribution is more complex than a linear one, and we generate a curve using linear models to fit non-linear data.

The independent (or explanatory) variables resulting from the polynomial expansion of the predictor variables are known as higher-degree terms.

It has been used to describe nonlinear phenomena such as the growth rate of tissues and the progression of disease epidemics.


RegularizationRegularization, is a widely used method to deal with overfitting.

It is done mainly by the following techniques:Reducing the model’s size: Reducing the number of learnable parameters in the model, and with them its learning capacity.

The goal is to get to a sweet spot between too much and not enough learning capacity.

Unfortunately, there aren’t any magical formulas to determine this balance, it must be tested and evaluated by setting different number of parameters and observing its performance.

Adding weight regularization: In general, the simpler the model the better.

As long it can learn well, a simpler model is much less likely to overfit.

A common way to achieve this, is to constraint the complexity of the network by forcing its weights to only take small values, regularizating the distribution of weight values.

This is done by adding to the loss function of the network a cost associated with having large weights.

The cost comes in two ways:L1 regularization: The cost is proportional to the square of the vlaue of the weight coefficients (L1 norm of the weights).

L2 regularization: The cost is proportional to the square of the value of the weight coefficients (l2 norm of the weights)To decide which of them to apply to our model, is recommended to keep the following information in mind and take into account the nature of our problem:The λ Parameter: It is the computed error by regularization.

If we have a large λ, then we are punishing complexity and will end up with a simpler model.

If we have a small λ we will end up with a complex model.


Evaluation MetricsIn order to keep track of how well our model is performing, we need to set up some evaluation metrics.

This evaluatioin metric is the error computed from the generated line (or hyperplane) to the real points and will be the function to minimize by the gradient descent .

Some of the most common whean dealing with regression are:9.

1 Mean Absolute Error:Mean Absolute Error or MAE, is the average of the absolute difference between the real data points and the predicted outcome.

If we take this as the strategy to follow, each step of the gradient descent would reduce the MAE.


2 Mean Square Error:Mean Square Error or MSE, is the average of the squared difference between the real data points and the predicted outcome.

This method penalizes more the the bigger the distance is and is the standard in regression problems.

If we take this as the strategy to follow, each step of the gradient descent would reduce the MSE.

This will be the preferred method to compute the best-fitting line, and it is also called Ordinary Least Squares or OLS.


3 Coefficient of Determination or R²The coefficient of determination can be understood as a standardize version of the MSE, which provides a better interpretability of the performance of the model.

Technically, the R² is the fraction of the response variance that is captured by the model, in other words it is the variance of the response.

It is defined as:10.

Other AlgorithmsEventhough trhough this article we have focused on linear and multiple regression models, in the popular Machine Learning library, Sci-kit learn (which is the one that we will be using througout this series) there are regression variants of virtually every type of algorithm.

And some of them yield very good results.

Some examples are:Decision Tree RegressorRandom Forest RegressorSupport Vector RegressorLassoElastic NetGradient Boosting RegressorAda Boost Regressor11.

ConclusionThroughout this article, we have covered the basics of regression models, learned how they work, the principal dangers and how to deal with them.

We also learned what are the most commonn evaluation metrics.

We have set the knowledge to start working with our first machine learning model and that exactly is what is going to be covered in the next article.

So if you want to learn how to work with the Sci-kit learn library with a regression problem, stay tuned!.

. More details

Leave a Reply