How to Create a Random-Split, Cross-Validation, and Bagging Ensemble for Deep Learning in Keras

It is likely that there will be a point of diminishing returns, after which the addition of further members no longer changes the performance of the ensemble.Nevertheless, we can evaluate different ensemble sizes from 1 to 10 and plot their performance on the unseen holdout dataset.We can also evaluate each model on the holdout dataset and calculate the average of these scores to get a much better approximation of the true performance of the chosen model on the prediction problem.Finally, we can compare and calculate a more robust estimate of the general performance of an average model on the prediction problem, then plot the performance of the ensemble size to accuracy on the holdout dataset.Tying all of this together, the complete example is listed below.Running the example first fits and evaluates 10 models on 10 different random splits of the dataset into train and test sets.From these scores, we estimate that the average model fit on the dataset will achieve an accuracy of about 83% with a standard deviation of about 1.9%.We then evaluate the performance of each model on the unseen dataset and the performance of ensembles of models from 1 to 10 models.From these scores, we can see that a more accurate estimate of the performance of an average model on this problem is about 82% and that the estimated performance is optimistic.A lot of the difference between the accuracy scores is happening in the fractions of percent.A graph is created showing the accuracy of each individual model on the unseen holdout dataset as blue dots and the performance of an ensemble with a given number of members from 1-10 as an orange line and dots.We can see that using an ensemble of 4-to-8 members, at least in this case, results in an accuracy that is better than most of the individual runs (orange line is above many blue dots).Line Plot Showing Single Model Accuracy (blue dots) vs Accuracy of Ensembles of Varying Size for Random-Split ResamplingThe graph does show some individual models can perform better than an ensemble of models (blue dots above the orange line), but we are unable to choose these models..Here, we demonstrate that without additional data (e.g. the out of sample dataset) that an ensemble of 4-to-8 members will give better on average performance than a randomly selected train-test model.More repeats (e.g. 30 or 100) may result in a more stable ensemble performance.A problem with repeated random splits as a resampling method for estimating the average performance of model is that it is optimistic.An approach designed to be less optimistic and is widely used as a result is the k-fold cross-validation method.The method is less biased because each example in the dataset is only used one time in the test dataset to estimate model performance, unlike random train-test splits where a given example may be used to evaluate a model many times.The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into..The average of the scores of each model provides a less biased estimate of model performance..A typical value for k is 10.Because neural network models are computationally very expensive to train, it is common to use the best performing model during cross-validation as the final model.Alternately, the resulting models from the cross-validation process can be combined to provide a cross-validation ensemble that is likely to have better performance on average than a given single model.We can use the KFold class from scikit-learn to split the dataset into k folds..It takes as arguments the number of splits, whether or not to shuffle the sample, and the seed for the pseudorandom number generator used prior to the shuffle.Once the class is instantiated, it can be enumerated to get each split of indexes into the dataset for the train and test sets.Once the scores are calculated on each fold, the average of the scores can be used to report the expected performance of the approach.Now that we have collected the 10 models evaluated on the 10 folds, we can use them to create a cross-validation ensemble..It seems intuitive to use all 10 models in the ensemble, nevertheless, we can evaluate the accuracy of each subset of ensembles from 1 to 10 members as we did in the previous section.The complete example of analyzing the cross-validation ensemble is listed below.Running the example first prints the performance of each of the 10 models on each of the folds of the cross-validation.The average performance of these models is reported as about 82%, which appears to be less optimistic than the random-splits approach used in the previous section.Next, each of the saved models is evaluated on the unseen holdout set.The average of these scores is also about 82%, highlighting that, at least in this case, the cross-validation estimation of the general performance of the model was reasonable.A graph of single model accuracy (blue dots) and ensemble size vs accuracy (orange line) is created.As in the previous example, the real difference between the performance of the models is in the fractions of percent in model accuracy.The orange line shows that as the number of members increases, the accuracy of the ensemble increases to a point of diminishing returns.We can see that, at least in this case, using four or more of the models fit during cross-validation in an ensemble gives better performance than almost all individual models.We can also see that a default strategy of using all models in the ensemble would be effective.Line Plot Showing Single Model Accuracy (blue dots) vs Accuracy of Ensembles of Varying Size for Cross-Validation ResamplingA limitation of random splits and k-fold cross-validation from the perspective of ensemble learning is that the models are very similar.The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples.Importantly, samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen..This allows a given observation to be included in a given small sample more than once..This approach to sampling is called sampling with replacement.The method can be used to estimate the performance of neural network models..Examples not selected in a given sample can be used as a test set to estimate the performance of the model.The bootstrap is a robust method for estimating model performance..It does suffer a little from an optimistic bias, but is often almost as accurate as k-fold cross-validation in practice.The benefit for ensemble learning is that each model is that each data sample is biased, allowing a given example to appear many times in the sample..This, in turn, means that the models trained on those samples will be biased, importantly in different ways..The result can be ensemble predictions that can be more accurate.Generally, use of the bootstrap method in ensemble learning is referred to as bootstrap aggregation or bagging.We can use the resample() function from scikit-learn to select a subsample with replacement.. More details