Machine Learning Models: Linear RegressionIsaiah NieldsBlockedUnblockFollowFollowingFeb 1Over the last few weeks, I have been quickly studying my way through Deep Learning by Goodfellow and Bengio.
And I’ve been learning a ton from it.
The book explores a broad range of machine learning + deep learning topics and delves deep into the math underlying everything.
It’s been a great resource to get a nuts and bolts understanding of the field.
To solidify what I’m reading about, I’ve decided to code up the models that I’m studying with vanilla Python and numpy.
If possible, I’ll also create a few basic visualizations of what’s going on.
Here, I’ll start with the first and most basic model of the book: linear regression.
Without further ado, here’s a rundown of what I’ve worked on.
MotivationBefore I get into the dirty details of the implementation, I’d first like to give a general idea for why one would use linear regression.
What is the motivation for what we are doing?Well, in general, linear regression is a great model for predicting a continuous variable, y, based on a continuous (and sometimes discrete) input, x.
The linear regression model combines the input, x, in such a way to give a good prediction for the output, y.
For example, we might use a linear regression model to predict the price of a house, y, based on some input features, x (e.
square footage, number of bedrooms, etc.
One key thing to note is that data (x, y pairs) is first needed to train the linear regression model before it’s able to make accurate predictions.
We’ll talk more about how this is done later.
Implementation!An implementation of a linear regression model has 5 major components: the model, the cost function, the parameters, the gradient, and the optimization algorithm (e.
the normal equation, gradient descent).
We’ll delve a bit more into each of these below.
Note that this implementation generalizes to higher dimensional spaces.
In other words, it can do a linear regression with (x₁, y) data, planar regression with (x₁, x₂, y) data, fit a 3D hyperplane to (x₁, x₂, x₃, y) data in 4 dimensions, and so on.
In general, our model fits an (n-1)-dimensional hyperplane to data in an n-dimensional space.
The ModelThe model is defined by a basic linear combination of a weight matrix, W, with our data, X, with a bias term, b tacked on to shift our prediction away from the origin.
Note here that the @ symbol is for matrix multiplication.
The Cost FunctionThe cost function sets the objective of linear regression.
It defines the target we’d like to hit.
In general, it’s important to have your cost function clearly defined.
You can’t hit a target until you define what it is, right?.Here, we’ll define the cost function as the average squared distance between our prediction for y and the ground truth value for y across all training examples.
The ParametersWe need values for our initial parameters.
The parameters don’t need to be anywhere close to perfect, just something to begin predicting with.
Here, we’ll initialize W to a vector of zeros and b to 0.
Parameter initialization is something like the knowledge you had when you were first born.
You really had no idea what was going on, but it was a start!The GradientWe use the gradient to learn what values our parameters should take on in order to minimize our cost function.
The formula for the gradient is found by taking the derivative of our cost function with respect to W and b.
The gradient gives us the direction in which we should shift W and b in order to minimize our cost function.
Remember: by minimizing our cost function we are, by definition, reducing the square distances between our model’s prediction for our training data, and the ground truth, y.
The Optimization AlgorithmI’ll cover two optimization algorithms here: gradient descent, and the normal equation.
Gradient descent involves using the gradient to minimize the cost function over many successive iterations until it converges (i.
reaches its lowest point).
Gradient descent has two primary parameters that need to be tuned: the number of epochs (i.
the number of iterations), and the learning rate (a value that scales the rate by which our W and b values are adjusted).
If the number of epochs and the learning rate are sufficiently set, gradient descent should converge on a value for W and b.
The normal equation is an alternate technique to solve for W and b.
Note that this algorithm is O(n³) where n is the number of training examples.
Putting it togetherLet’s put it all together and run the model!.First, we’ll initialize our parameters, W and b.
Then we’ll run gradient descent with those parameters, and our training data, X and y, giving us our trained W and b.
From there, we can calculate our cost averaging over all training examples.
See below for an example of what our model can do!Looking back on the exerciseWhen I first set out to code up linear regression from scratch, I was disorganized and honestly didn’t really have a very good respect for the problem.
I mean how hard could it be?.It’s just basic linear regression.
And so I just started coding.
Well, I definitely shot myself in the foot by assuming it would be easy.
I didn’t create a very good plan of how I would be keeping my data and my model compatible nor did I really plan out how each of the pieces of the model would fit together.
This caused a few bugs that threw errors and were pretty basic to fix, however, other bugs were not so easy.
The last bug that I found in my code, for example, occurred when numpy would broadcast an incorrectly calculated dW value across the entire W matrix.
So lessons learned here: (1) keep your dimensions organized and have a plan of how your model and your data will fit together, (2) don’t assume data science is easy.
Overall, implementing linear regression really helped solidify and structure my theoretical understanding of the topic.
I look forward to implementing more complex models in the future!.. More details