Well, all it really means is that your training and validation data sets are similar, i.
follow the same distributions or patterns.
If that is not the case then you’re training your model on apples, but then try to predict on oranges.
The result will be very poor predictions.
You could do lots of Exploratory Data Analysis (EDA) and check that each feature behaves similar across both datasets.
But that could be really time consuming.
A neat and quick way of testing whether you have a representative or good validation set is to run a Random Forest Classifier.
In this Kaggle kernel I did exactly that.
I first prepared both training and validation data and then added an extra column ‘train’, which takes the value of 1 when the data is training data and 0 when it is validation data.
This is the target that the Random Forest Classifier is going to predict.
# Create the new targettrain_set['train'] = 1validation_set['train'] = 0# Concatenate the two datasetstrain_validation = pd.
concat([train_set, validation_set], axis=0)The next step is to get your indepdent (X) and dependent (y) features ready, set up the Random Forest Classifier, and run cross validation.
I am using the metric ROC AUC, which is a common metric for classification tasks.
If the metric is 1 then you’re predicting perfectly.
If the score is 0.
5 then you’re as good as the baseline, which is the score that you would get if you always predicted the most common outcome.
If the score is below 0.
5 then you’re doing something wrong.
# Import the librariesfrom sklearn.
ensemble import RandomForestClassifierfrom sklearn.
model_selection import cross_val_score# Split up the dependent and independent variablesX = train_validation.
drop('train', axis=1)y = train_validation['train']# Set up the modelrfc = RandomForestClassifier(n_estimators=10, random_state=1)# Run cross validationcv_results = cross_val_score(rfc, X, y, cv=5, scoring='roc_auc')Now, what do you think the ROC AUC should be if training and validation set behave the same way? …That’s right, 0.
5!.If the score is 0.
5 then it means that training and validation data are indistinguishable, which is what we want.
Once we have run cross validation, let’s get the scores… And great news!.The score is indeed 0.
That means the Kaggle hosts have set up a representative validation set for us.
Sometimes that’s not the case and this is a great quick way of checking this.
In real life, however, you have to come up with a validation set yourself and this will hopefully come in handy to make sure that you set up a correct validation set.