Foundations of Data Science: Classification and Regression in Machine LearningHenry BlaisBlockedUnblockFollowFollowingFeb 2Classification and Regression are two very important concepts for modeling with Machine Learning.
In this short article, I’ll go over the basics of these concepts and how they can be applied to simple Data Science questions.
In this post, I’ll cover:Predictive ModelingPredictions with ClassificationPredictions with RegressionOverlap between types of Predictive ModelingApplications of Classification and RegressionPredictive ModelingAt the most basic level, Predictive Modeling is designed to use historical data to predict unknown data.
The unknown data we predict for can be almost anything; from categorical attributes, to approximations of incomplete data, and even predictions of future outcomes!Speaking more practically, making predictions on data always follows the same general structure; we will input existing data (X) to a modeling function (f), to output new predicted data (y).
Modeling functions are a diverse and complex group of mathematical functions that help us to derive predictions from our data — I’ll go over some simple functions later on in this post and in future articles, but for now just think of them as a lens or a filter: a tool that adjusts what we see and gives new insight.
Classification Models and PredictionsPredictive modeling can be used for a lot of great applications, one of which is Classification.
Classification models help us to sort observations from our data into discrete, closed categories.
What is a discrete category?Discrete categories can be understood as labels that either apply, or do not apply.
For example, imagine that you just bought a bagel at your favorite bakery, and got back a handful of change after your purchase.
Looking at the coins in your hands, you can see the different kinds of coins and recognize them as distinct from each other.
A quarter and a nickel might have attributes in common (they are round, metal objects…), but in your mind they are recognizably different objects.
Most importantly, a quarter has no attribute of “nickel-ness”; there is no degree to which a quarter is marginally a nickel, a coin can be one or the other, not both.
Classification models are ideal for using all kinds of data to categorize or classify observations into discrete classes.
Returning to the example above, a classification model might make the distinction between coins for you, by deciding for each individual coin, based on the attributes of that coin (X), whether it either IS or IS NOT a quarter (y).
All classification models rely on a few baseline assumptions:All observations (in this case, coins) must be separable amongst two or more known classes.
Classification models can have both discrete AND continuous inputs.
Classification can be used to predict continuous data (more on these later on), but only when that data is encoded — or rephrased — to imply a categorical split.
A typical Classification model seeks to answer a question like this;“For a given population of citizens, which voters are likely to vote for a specific candidate?”3.
Regression Models and PredictionsRegression models work a little differently!.Whereas Classification is used to split up observations into separately defined categories, Regression is helpful for predicting a continuous output variableWhat is a continuous variable?Continuous variables are numeric variables that can exist within a range of possibilities.
For example, temperature, value, income, height, and weight can all be understood as continuous variables, because a given value for any of these things doesn’t necessarily fall under a categorical label like discrete variables do.
For instance, if you polled all your friends and asked them exactly how tall they were, you might get a range of responses, say, between 5'5" and 6'5", that don’t fall into a category on their own.
A Regression problem attempts to use existing data to predict continuous variables in the same way that Classification models predict discrete variables4.
Overlap between types of Predictive ModelingAt this point you might be thinking that Classification and Regression don’t really sound that different at all.
You’re not far off!.Conceptually speaking, it’s not hard to see that discrete variables and continuous variables can both be converted from one group to the other.
For example, lets say you go through with polling five of your friends, and find that they are 5'5", 5'10, 5'9", 6'1" and 6'5" feet tall respectively.
I’ve already described this as a continuous variable, but couldn’t we just reframe the question a little, and treat each of those heights as a “class” of height for a Classification model?.Yes, you absolutely could.
Similarly, you could very easily take the change in your hand in my first example of discrete variables, and rephrase the classes, so that “a quarter” is now considered “$0.
Converting types of variables is a completely valid and useful part of predictive modeling, but as you work with more and larger datasets, you will find that using the correct type of model for job is absolutely vital.
Let’s look at how applying a model to the wrong kind of predicted variable can cause trouble for an aspiring Data Scientist:5.
Applications of Classification and RegressionReturning to your height survey, lets say you’ve decided to use a Classification model to predict the height of all your friends and family, based on other data you know about them.
You’ve already “trained” your model with your starting data — the initial survey of your friends — and decided what the height classes will be; 5'5", 5'10, 5'9", 6'1" and 6'5".
If you now want to predict the heights of some other friends, you could apply the same model (more on this in later articles), and see how it does.
But what if members of your new survey group don’t fall into these categories?.If some of these new observations fall into the range between 5'5" and 5'10", for instance, a Classification model will have no choice but to choose between those two “classes”, meaning that anyone who stands 5'9" will automatically be mislabeled in your final predictions!.A Classification model definitely CAN be used this way, but it is less accurate and less useful than a Regression model would be — using the right tool for the job is a matter of getting the best predictions that you can.
The same principle applies to Regression models.
One final note here; you could expand on your survey, and include every conceivable height that your imagination and math can define as a possible height for your friends.
If you went down this rabbit hole, you might find yourself including millions of classes for your Classification model to learn (eg, 5 feet and one one-millionth of an inch tall), before you try to make new predictions.
In this case, congratulations!.You’ve just reinvented the wheel, because what you’re trying to re-create is essentially a Regression model.
By stratifying your classes to this degree, you’re basically simulating a continuous variable, and you would be better served — in terms of your time and the accuracy of your predictions — by just using a Regression model instead.
Thanks for reading!.In my next posts I’ll delve into the practical applications of some common Regression and Classification models, so you can see how these work on some real data!.