What this means is the decision tree tries to form nodes containing a high proportion of samples (data points) from a single class by finding values in the features that cleanly divide the data into classes.We’ll talk in low-level detail about Gini Impurity later, but first, let’s build a Decision Tree so we can understand it on a high level.Decision Tree on Simple ProblemWe’ll start with a very simple binary classification problem as shown below:The goal is to divide the data points into their respective classes.Our data only has two features (predictor variables), x1 and x2 with 6 data points — samples — divided into 2 different labels..Effectively, a decision tree is a non-linear model built by constructing many linear boundaries.To create a decision tree and train (fit) it on the data, we use Scikit-Learn.During training we give the model both the features and the labels so it can learn to classify points based on the features..To classify a new point, simply move down the tree, using the features of the point to answer the questions until you arrive at a leaf node where the class is the prediction.To make see the tree in a different way, we can draw the splits built by the decision tree on the original data.Splits made by the decision tree.Each split is a single line that divides data points into nodes based on feature values..An inflexible model may not have the capacity to fit even the training data and in both cases — high variance and high bias — the model is not able to generalize well to new data.The balance between creating a model that is so flexible it memorizes the training data versus an inflexible model that can’t learn the training data is known as the bias-variance tradeoff and is a foundational concept in machine learning.The reason the decision tree is prone to overfitting when we don’t limit the maximum depth is because it has unlimited flexibility, meaning that it can keep growing until it has exactly one leaf node for every single observation, perfectly classifying all of them..We have reduced the variance of the decision tree but at the cost of increasing the bias.As an alternative to limiting the depth of the tree, which reduces variance (good) and increases bias (bad), we can combine many decision trees into a single ensemble model known as the random forest.Random ForestThe random forest is a model made up of many decision trees..Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random:Random sampling of training data points when building treesRandom subsets of features considered when splitting nodesRandom sampling of training observationsWhen training, each tree in a random forest learns from a random sample of the data points..This procedure of training each individual learner on different bootstrapped subsets of the data and then averaging the predictions is known as bagging, short for bootstrap aggregating.Random Subsets of features for splitting nodesThe other main concept in the random forest is that only a subset of all the features are considered for splitting each node in each decision tree..(The random forest can also be trained considering all the features at every node as is common in regression. These options can be controlled in the Scikit-Learn Random Forest implementation).If you can comprehend a single decision tree, the idea of bagging, and random subsets of features, then you have a pretty good understanding of how a random forest works:The random forest combines hundreds or thousands of decision trees, trains each one on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features..The final predictions of the random forest are made by averaging the predictions of each individual tree.To understand why a random forest is better than a single decision tree imagine the following scenario: you have to decide whether Tesla stock will go up and you have access to a dozen analysts who have no prior knowledge about the company..Once we have the testing predictions, we can calculate the ROC AUC.ResultsThe final testing ROC AUC for the random forest was 0.87 compared to 0.67 for the single decision tree with an unlimited max depth..If we look at the training scores, both models achieved 1.0 ROC AUC, which again is as expected because we gave these models the training answers and did not limit the maximum depth of each tree.Although the random forest overfits (doing better on the training data than on the testing data), it is able to generalize much better to the testing data than the single decision tree..The random forest has lower variance (good) while maintaining the same low bias (also good) of a decision tree.We can also plot the ROC curve for the single decision tree (top) and the random forest (bottom)..A curve to the top and left is a better model:Decision Tree ROC CurveRandom Forest ROC CurveThe random forest significantly outperforms the single decision tree.Another diagnostic measure of the model we can take is to plot the confusion matrix for the testing predictions (see the notebook for details):This shows the predictions the model got correct in the top left and bottom right corners and the predictions missed by the model in the lower left and upper right.. More details