How Does Linear Regression Actually Work?Anas Al-MasriBlockedUnblockFollowFollowingMar 18(Source: https://www.

sciencenews.

org/article/online-reading-behavior-predicts-stock-movements)Linear Regression is inarguably one of the most famous topics in both statistics and Machine Learning.

It is so essential to the point where it withholds a significant part in almost any Machine Learning course out there.

However, it could be a bit tricky to wrap the head around, especially if one has no statistics background.

What is Linear Regression?Linear Regression can be considered a Machine Learning algorithm that allows us to map numeric inputs to numeric outputs, fitting a line into the data points.

In other words, Linear Regression is a way of modelling the relationship between one or more variables.

From the Machine Learning perspective, this is done to ensure generalization — giving the model the ability to predict outputs for inputs it has never seen before.

Why Generalize?If you read any of my other posts here on Medium, you will notice that I try to emphasize on the idea of generalization as much as possible.

Generalization is the essence of Machine Learning.

The whole idea of having this artificial form of intelligence relies on the process of teaching a model so well to the point where it can “act” on its own.

In other words, you want the model to not be limited to whatever it has learned.

Think of it as a child.

If your child has only seen cats his whole life — for some disturbing reason that you imposed on him — and if at some point you decide to show him a picture of a dog, you’d expect him to know that the dog is not a cat.

It is not something he has learned.

Why Linear Regression?So a group of creative Tech enthusiasts started a company in Silicon Valley.

This start-up — called Banana — is so innovative that they have been growing constantly since 2016.

You, the wealthy investor, would like to know whether to put your money on Banana’s success in the next year or not.

Let’s assume that you don’t want to risk a lot of money, especially since the stakes are high in Silicon Valley.

So you decide to buy a few shares, instead of investing into a big portion of the company.

You take a look at the Banana’s stock prices ever since they were kick-started, and you see the following figure.

Well, you can definitely see the trend.

Banana is growing like crazy, kicking up their stock price from 100 dollars to 500 in just three years.

You only care about how the price is going to be like in the year 2021, because you want to give your investment some time to blossom along with the company.

Optimistically speaking, it looks like you will be growing your money in the upcoming years.

The trend is likely not to go through a sudden, drastic change.

This leads to you hypothesizing that the stock price will fall somewhere above the $500 indicator.

Here’s an interesting thought.

Based on the stock price records of the last couple of years you were able to predict what the stock price is going to be like.

You were able to infer the range of the new stock price (that doesn’t exist on the plot) for a year that we don’t have data for (the year 2021).

Well — kinda.

What you just did is infer your model (that head of yours) to generalize — predict the y-value for an x-value that is not even in your knowledge.

However, this is not accurate in any way.

You couldn’t specify what exactly is the stock price most likely going to be.

For all you know, it is probably going to be above 500 dollars.

Here is where Linear Regression (LR) comes into play.

The essence of LR is to find the line that best fits the data points on the plot, so that we can, more or less, know exactly where the stock price is likely to fall in the year 2021.

Let’s examine the LR-generated line (in red) above, by looking at the importance of it.

It looks like, with just a little modification, we were able to realize that Banana’s stock price is likely to be worth a little bit higher than $600 by the year 2021.

Obviously, this is an oversimplified example.

However, the process stays the same.

Linear Regression as an algorithm relies on the concept of lowering the cost to maximize the performance.

We will examine this concept, and how we got the red line on the plot next.

Training The Linear RegressorTo get the technicalities out of the way.

What I described in the previous section is referred to as Univariate Linear Regression, because we are trying to map one independent variable (x-value) to one dependent variable (y-value).

This is in contrast to Multivariate Linear Regression, where we try to map multiple independent variables (i.

e.

features) to a dependent variable (i.

e.

labels).

Now, let’s get down to business.

Any straight line on a plot follows the formula:f(X) = M.

X + BWhere M is the slope of the line, B is the y-intercept that allows vertical movement of the line, and X which is the function’s input value.

In terms of Machine Learning, this follows the convention:h(X) = W0 + W1.

XWhere W0 and W1 are weights, X is the input feature, and h(X) is the label (i.

e.

y-value).

The way Linear Regression works is by trying to find the weights (namely, W0 and W1) that lead to the best-fitting line for the input data (i.

e.

X features) we have.

The best-fitting line is determined in terms of lowest cost.

So, What is The Cost?Here’s the thing.

Cost could take different forms, depending on the Machine Learning application at hand.

However, in general, cost refers to the loss or error that the model yields in terms of how off it is from the actual Training data.

When it comes to Linear Regression, the cost function we usually use is the Squared Error Cost.

J(W0,W1) = (1/2n).

sigma((h(Xi)-Ti)^2) for all i=1 until i=nWhere J(W0,W1) refers to the total cost of the model with weights W0, W1.

h(Xi) refers to the model’s prediction of the y-value at feature X with index i.

Ti is the actual y-value at index i.

And finally, n is the total number of data points in the data set.

All what our cost function is doing is basically getting the distance (e.

g.

Euclidean distance) between what y-value the model predicted and what the actual y-value resident in the data set is for every data point, then squaring this distance and dividing it by the number of data points we have to get the average cost.

Said distances are illustrated in the above figure as error vectors.

The 2 in the term (1/2n) is merely to ease the process of differentiating the cost function in the next section.

Where is Training in All This?Training a Machine Learning model is all about using a Learning Algorithm to find the weights (W0, W1 in our formula) that minimize the cost.

For simplicity, let’s use the Gradient Descent algorithm for this.

Although it is a fairly simple topic, Gradient Descent deserves its own post.

Therefore, we will only go through it briefly.

In the context of Linear Regression, training is basically finding those weights and plugging them into the straight line function so that we have best-fit line (with W0, W1 minimizing the cost).

The algorithm basically follows the pseudo-code:Repeat until convergence { temp0 := W0 – a.

((d/dW0) J(W0,W1)) temp1 := W1 – a.

((d/dW1) J(W0,W1)) W0 = temp0 W1 = temp1}Where (d/dW0) and (d/dW1) are the partial derivatives of J(W0,W1) with respect to W0 and W1, respectively.

The gist of this partial differentiation is basically the derivatives:(d/dW0) J(W0,W1) = W0 + W1.

X – T(d/dW1) j(W0,W1) = (W0 + W1.

X – T).

XIf we run the Gradient Descent learning algorithm on the model, and through the costs obtained at every step, the model will converge to a minimum cost.

The weights that led to that minimum cost are dealt with as the final values for the line function we mentioned earlier (i.

e.

h(X) = W0 + W1.

X).

This means that the line equivalent to our h(X) function is actually our Linear Regressor.

A Side Note: PerformanceSometimes, when the Training data set includes a huge amount of data points whose values are inconsistent, we resort to a process called Discretization.

This refers to converting the Y values in the data set from continuous to discrete, resulting in succinct, clean and usable ranges of data rather than the data values themselves.

However, this leads to data loss, as you would technically be breaking up data points into bins that symbolize ranges of continuous values.

Another major factor in the effectiveness of the model afterwards would be its dependence on the number of bins/ranges we choose.

In case of bad Linear Regression model performance, we usually go for a higher polynomial function.

This is basically the introduction of new variables into the Regressor function so that we allow more flexibility to it.

However, this will cause the LR line not to be a straight line anymore.

It turns out that, in terms of Linear Regression, “linear” does not refer to “straight line”, but rather to “falling on one line”.

This means that our Linear Regressor does not actually have to be a straight line, as we are usually used to see in mathematics.

This flexibility in regression could improve the performance drastically.

However, higher polynomials could lead to higher variance, and exponentially higher computational complexity.

More often than not, this leads to over-fitting.

This is a big topic that I will talk about extensively in a separate post.

ConclusionLinear Regression is the process of finding a line that best fits the data points available on the plot, so that we can use it to predict output values for inputs that are not present in the data set we have, with the belief that those outputs would fall on the line.

Performance (and error rates) depends on various factors including the how clean and consistent the data is.

There are different ways of improving the performance (i.

e.

generalizability) of the model.

However, each one has its own pros and cons, which makes the choice of methods application-dependent.

.. More details