# Machine Learning Algorithms In Layman’s Terms, Part 1

You could probably do it manually, but it would take forever.

That’s where gradient descent comes in!Our “line of best fit” is in red above.

It does this by trying to minimize something called RSS (the residual sum of squares), which is basically the sum of the squares of the differences between our dots and our line, i.

e.

how far away our real data (dots) is from our line (red line).

We get a smaller and smaller RSS by changing where our line is on the graph, which makes intuitive sense — we want our line to be wherever it’s closest to the majority of our dots.

We can actually take this further and graph each different line’s parameters on something called a cost curve.

Using gradient descent, we can get to the bottom of our cost curve.

At the bottom of our cost curve is our lowest RSS!Gradient Descent visualized (using MatplotLib), from the incredible Data Scientist Bhavesh BhattThere are more granular aspects of gradient descent like “step sizes” (i.

e.

how fast we want to approach the bottom of our skateboard ramp) and “learning rate” (i.

e.

what direction we want to take to reach the bottom), but in essence: gradient descent gets our line of best fit by minimizes the space between our dots and our line of best fit.

Our line of best fit, in turn, allows us to make predictions!”Linear RegressionMe-to-grandma:“Super simply, linear regression is a way we analyze the strength of the relationship between 1 variable (our “outcome variable”) and 1 or more other variables (our “independent variables”).

A hallmark of linear regression, like the name implies, is that the relationship between the independent variables and our outcome variable is linear.

For our purposes, all that means is that when we plot the independent variable(s) against the outcome variable, we can see the points start to take on a line-like shape, like they do below.

(If you can’t plot your data, a good way to think about linearity is by answering the question: does a certain amount of change in my independent variable(s) result in the same amount of change in my outcome variable? If yes, your data is linear!)This looks a ton like what we did above!.That’s because the line of best fit we discussed before IS our “regression line” in linear regression.

The line of best fit shows us the best possible linear relationship between our points.

That, in turn, allows us to make predictions.

Another important thing to know about linear regression is that the outcome variable, or the thing that changes depending on how we change our other variables, is always continuous.

But what does that mean?Let’s say we wanted to measure what effect elevation has on rainfall in New York State: our outcome variable (or the variable we care about seeing a change in) would be rainfall, and our independent variable would be elevation.

With linear regression, that outcome variable would have to be specifically how many inches of rainfall, as opposed to just a True/False category indicating whether or not it rained at x elevation.

That is because our outcome variable has to be continuous — meaning that it can be any number (including fractions) in a range of numbers.

The coolest thing about linear regression is that it can predict things using the line of best fit that we spoke about before!.If we run a linear regression analysis on our rainfall vs.

elevation scenario above, we can find the line of best fit like we did in the gradient descent section (this time shown in blue), and then we can use that line to make educated guesses as to how much rain one could reasonably expect at some elevation.

”Ridge & LASSO RegressionMe, continuing to hopefully-not-too-scared-grandma:“So linear regression’s not that scary, right?.It’s just a way to see what effect something has on something else.

Cool.

Now that we know about simple linear regression, there are even cooler linear regression-like things we can discuss, like ridge regression.

Like gradient descent’s relationship to linear regression, there’s one back-story we need to cover to understand ridge regression, and that’s regularization.

Simply put, data scientists use regularization methods to make sure that their models only pay attention to independent variables that have a significant impact on their outcome variable.

You’re probably wondering why we care if our model uses independent variables that don’t have an impact.

If they don’t have an impact, wouldn’t our regression just ignore them?.The answer is no!.We can get more into the details of machine learning later, but basically we create these models by feeding them a bunch of “training” data.

Then, we see how good our models are by testing them on a bunch of “test” data.

So, if we train our model with a bunch of independent variables, with some that matter and some that don’t, our model will perform super well on our training data (because we are tricking it to think all of what we fed it matters), but super poorly on our test data.

This is because our model isn’t flexible enough to work well on new data that doesn’t have every.

single.

little.

thing we fed it during the training phase.

When this happens, we say that the model is “overfit.

”To understand over-fitting, let’s look at a (lengthy) example:Let’s say you’re a new mother and your baby boy loves pasta.

As the months go by, you make it a habit to feed your baby pasta with the kitchen window open because you like the breeze.

Then your baby’s cousin gets him a onesie, and you start a tradition of only feeding him pasta when he’s in his special onesie.

Then you adopt a dog who diligently sits beneath the baby’s highchair to catch the stray noodles while he’s eating his pasta .

At this point, you only feed your baby pasta while he’s wearing the special onesie …and the kitchen window’s open …and the dog is underneath the highchair.

As a new mom you naturally correlate your son’s love of pasta with all of these features: the open kitchen window, the onesie, and the dog.

Right now, your mental model of the baby’s feeding habits is pretty complex!One day, you take a trip to grandma’s.

You have to feed your baby dinner (pasta, of course) because you’re staying the weekend.

You go into a panic because there is no window in this kitchen, you forgot his onesie at home, and the dog is with the neighbors!.You freak out so much that you forget all about feeding your baby his dinner and just put him to bed.

Wow.

You performed pretty poorly when you were faced with a scenario you hadn’t faced before.

At home you were perfect at it, though!.It doesn’t make sense!After revisiting your mental model of your baby’s eating habits and disregarding all the “noise,” or things you think probably don’t contribute to your boy actually loving pasta, you realize that the only thing that really matters is that it’s cooked by you.

The next night at grandma’s you feed him his beloved pasta in her windowless kitchen while he’s wearing just a diaper and there’s no dog to be seen.

And everything goes fine!.Your idea of why he loves pasta is a lot simpler now.

That is exactly what regularization can do for a machine learning model.

So, regularization helps your model only pay attention to what matters in your data and gets rid of the noise.

On the left: LASSO regression (you can see that the coefficients, represented by the red rungs, can equal zero when they cross the y-axis).

On the right: Ridge regression (you can see that the coefficients approach, but never equal zero, because they never cross the y-axis).

Meta-credit: “Regularization in Machine Learning” by Prashant GuptaIn all types of regularization, there is something called a penalty term (the Greek letter lambda: λ).

This penalty term is what mathematically shrinks the noise in our data.

In ridge regression, sometimes known as “L2 regression,” the penalty term is the sum of the squared value of the coefficients of your variables.

(Coefficients in linear regression are basically just numbers attached to each independent variable that tell you how much of an effect each will have on the outcome variable.

Sometimes we refer to them as “weights.

”) In ridge regression, your penalty term shrinks the coefficients of your independent variables, but never actually does away with them totally.

This means that with ridge regression, noise in your data will always be taken into account by your model a little bit.

Another type of regularization is LASSO, or “L1” regularization.

In LASSO regularization, instead of penalizing every feature in your data, you only penalize the high coefficient-features.

Additionally, LASSO has the ability to shrink coefficients all the way to zero.

This essentially deletes those features from your data set because they now have a “weight” of zero (i.

e.

they’re essentially being multiplied by zero).

” With LASSO regression, your model has the potential to get rid of most all of the noise in your dataset.

This is super helpful in some scenarios!Logistic RegressionMe-to-grandma:“So, cool, we have linear regression down.

Linear regression = what effect some variable(s) has on another variable, assuming that 1) the outcome variable is continuous and 2) the relationship(s) between the variable(s) and the outcome variable is linear.

But what if your outcome variable is “categorical”?.That’s where logistic regression comes in!Categorical variables are just variables that can be only fall within in a single category.

Good examples are days of the week —if you have a bunch of data points about things that happened on certain days of the week, there is no possibility that you’ll ever get a datapoint that could have happened sometime between Monday and Tuesday.

If something happened on Monday, it happened on Monday, end of story.

But if we think of how our linear regression model works, how would it be possible for us to figure out a line of best fit for something categorical?.It would be impossible!.That is why logistic regression models output a probability of your datapoint being in one category or another, rather than a regular numeric value.

That’s why logistic regression models are primarily used for classification.

Scary looking graph that’s actual super intuitive if you stare at it long enough.