We try every possible split for the 6 datapoints we have and realize that y=2 is the best split.
We make that into a decision node and now have this:Our decision tree is almost done…2.
3 Training a Decision Tree: When to Stop?Let’s keep it going and try to make a third decision node.
We’ll use the right branch from the root node this time.
The only datapoints in that branch are the 3 greens.
Again, we try all the possible splits, but they allAre equally good.
Have a Gini Gain of 0 (the Gini Impurity was already 0 and can’t go any lower).
It doesn’t makes sense to add a decision node here because doing so wouldn’t improve our decision tree.
Thus, we’ll make this node a leaf node and slap the Green label on it.
This means that we’ll classify any datapoint that reaches this node as Green.
If we continue to the 2 remaining nodes, the same thing will happen: we’ll make the bottom left node our Blue leaf node, and we’ll make the bottom right node our Red leaf node.
That brings us to the final result:Once all possible branches in our decision tree end in leaf nodes, we’re done.
We’ve trained a decision tree!3.
Random Forests ????????????????????We’re finally ready to talk about Random Forests.
Remember what I said earlier?A Random Forest is actually just a bunch of Decision Trees bundled together.
That’s true, but is a bit of a simplification.
1 BaggingConsider the following algorithm to train a bundle of decision trees given a dataset of n points:Sample, with replacement, n training examples from the dataset.
Train a decision tree on the n samples.
Repeat t times, for some t.
To make a prediction using this model with tt trees, we aggregate the predictions from the individual decision trees and eitherTake the majority vote if our trees produce class labels (like colors).
Take the average if our trees produce numerical values (e.
when predicting temperature, price, etc).
This technique is called bagging, or bootstrap aggregating.
The sampling with replacement we did is known as a bootstrap sample.
Bagged decision trees are very close to Random Forests — they’re just missing one thing…3.
2 Bagging → Random ForestBagged decision trees have only one parameter: t, the number of trees.
Random Forests have a second parameter that controls how many features to try when finding the best split.
Our simple dataset for this tutorial only had 2 features (x and y), but most datasets will have far more (hundreds or thousands).
Suppose we had a dataset with pp features.
Instead of trying all features every time we make a new decision node, we only try a subset of the features.
We do this primarily to inject randomness that makes individual trees more unique and reduces correlation between trees, which improves the forest’s performance overall.
This technique is sometimes referred to as feature bagging.
Now What?That’s a beginner’s introduction to Random Forests!.A quick recap of what we did:Introduced decision trees, the building blocks of Random Forests.
Learned how to train decision trees by iteratively making the best split possible.
Defined Gini Impurity, a metric used to quantify how “good” a split is.
Saw that a random forest = a bunch of decision trees.
Understood how bagging combines predictions from multiple trees.
Learned that feature bagging is the difference between bagged decision trees and a random forest.
A few things you could do from here:Experiment with scikit-learn’s DecisionTreeClassifier and RandomForestClassifier classes on real datasets.
Try writing a simple Decision Tree or Random Forest implementation from scratch.
I’m happy to give guidance or code review!.Just tweet at me or email me.
Read about Gradient Boosted Decision Trees and play with XGBoost, a powerful gradient boosting library.
Read about ExtraTrees, an extension of Random Forests, or play with scikit-learn’s ExtraTreesClassifier class.
That concludes this tutorial.
I like writing about Machine Learning (but also other topics), so subscribe if you want to get notified about new posts.
Thanks for reading!Originally published at victorzhou.