Why and How to do Cross Validation for Machine LearningGeorge SeifBlockedUnblockFollowFollowingMay 24Cross-validation is a statistical technique for testing the performance of a Machine Learning model.
In particular, a good cross validation method gives us a comprehensive measure of our model’s performance throughout the whole dataset.
All cross validation methods follow the same basic procedure:(1) Divide the dataset into 2 parts: training and testing(2) Train the model on the training set(3) Evaluate the model on the testing set(4) Optionally, repeat steps 1 to 3 for a different set of data pointsMore thorough cross validation methods with include step 4 since such a measurement will be more robust to biases that may come with selecting a particular split.
Bias that comes from selecting a particular part of the data is known as Selection Bias.
Such methods will take more time since the model will be trained and validated multiple times.
But it does offer the significant advantage of being more thorough as well as having the chance to potentially find a split that squeezes out that last bit of accuracy.
Aside from Selection Bias, cross validation also helps us with avoiding overfitting.
By dividing the dataset into a train and validation set, we can concretely check that our model performs well on data seen during training and not.
Without cross validation, we would never know if our model is amazing worldwide or just on our sheltered training set!With all that theory being said, let’s take a look at 3 common cross validation techniques.
An illustration of how overfitting worksHoldoutHoldout cross validation is the simplest and most common.
We simply split the data into two sets: train and test.
The train and test data must not have any of the same data points.
Generally, this split will be close to 85% of the data for training and 15% of the data for testing.
The diagram below illustrates how holdout cross validation would work.
The advantage of using a very simple holdout cross validation is that we only need to train one model.
If it performs well enough, we can go ahead and use it in whatever application we intended to.
This is perfectly suitable as long as your dataset is relatively uniform in terms of distribution and “difficulty.
”The danger and disadvantage of holdout cross validation arises when the dataset is not completely even.
In splitting our dataset we may end up splitting it in such a way that our training set is very different from the test, or easier, or harder.
Thus the single test that we perform with holdout isn’t comprehensive enough to properly evaluate our model.
We end up with bad things like overfitting or inaccurately measuring our model’s projected real-world performance.
How to implement holdout cross validation in Scikit-Learn: sklearn.
train_test_splitK-Fold Cross ValidationUsing K-Fold Cross Validation will help you get past a lot of the drawbacks that come with using Holdout.
With K-Fold, we’re going to randomly split our dataset into K equally sized parts.
We will then train our model K times.
For each training run, we select a single partition from our K parts to be the test set and use the rest for training.
For example, if we set K = 10 as in the example below, then we will train 10 models.
Each model will be trained on a unique training set — the parts shown in blue.
Each model will also be tested on a unique test set — the parts shown in green.
To obtain a final accuracy measure, we average out the results of each model evaluated on their respective test sets.
The big advantage that comes with K-Fold Cross Validation is that its much less prone to selection bias since training and testing is performed on several different parts.
In particular, if we increase the value of K, we can be even more sure of the robustness of our model since we’ve trained and tested on so many different sub-datasets.
The only possible drawback of this method is that as we gain robustness by increasing K, we also have to train more model — a potentially tedious and expensive process.
How to implement K-Fold cross validation in Scikit-Learn: sklearn.
KFoldRepeated random sub-samplingRepeated random sub-sampling is perhaps the most robust of cross validation methods.
Similar to K-Fold, we set a value for K which signifies the number of times we will train our model.
However, in this case K will not represent the number of equally sized partitions.
Instead, on each training iteration, we randomly select points to be for the testing set.
The number of points we select will be a certain percentage we set for the testing set.
For example, if we select 15%, then on each training iteration we will randomly select 15% of the points in our dataset to be set aside for training.
The rest of the procedure continues the same way as K-Fold.
Train on the training set, test each model on its unique test set, average out the results at the end to obtain a final accuracy.
The clear advantage of this method over K-Fold is that the proportion of the train-test split is not dependent on the number of iterations.
We can even set different percentages for the test set on each iteration if we wanted too.
Randomisation may also be more robust to selection bias.
The disadvantage of this method is that some points may never be selected to be in the test subset at all — at the same time, some points might be selected multiple times.
This is a direct result of the randomisation.
Yet with K-Fold there is a guarantee that all points will at some time be tested on.
Like to learn?Follow me on twitter where I post all about the latest and greatest AI, Technology, and Science!.Connect with me on LinkedIn too!Recommended ReadingWant to learn more about Machine Learning?.The Hands-On Machine Learning book is the best resource out there for learning how to do real Machine Learning with Python!And just a heads up, I support this blog with Amazon affiliate links to great books, because sharing great books helps everyone!.As an Amazon Associate I earn from qualifying purchases.
.. More details