How Do You Know You Have Enough Training Data?

What Happens in the Case of Deep Learning?Figure 1.

Figure 1 shows how the performance of machine learning algorithms changes with increasing data size in the case of traditional machine learning [10] algorithms (regression, etc.

) and in the case of deep learning [11].

Specifically, for traditional machine learning algorithms, performance grows according to a power law and then reaches a plateau.

Regarding deep learning, there is significant ongoing research as to how performance scales with increasing data size [12]-[16], [18].

Figure 1 shows the current consensus for much of this research; for deep learning, performance keeps increasing with data size according to a power law.

For example, in [13], the authors used deep learning techniques for classification of 300 million images, and they found that performance increased logarithmically with increasing training data size.

Let us include here some noteworthy, contradictory to the above, results in the field of deep learning.

Specifically, in [15] the authors used convolutional networks for a dataset of 100 million Flickr images and captions.

Regarding training data size, they report that performance increases with growing data size; however, it plateaus after 50 million images.

In [16], the authors found that image classification accuracy increases with training data size; however, model robustness, which also increased initially, after a certain model-dependent point, started to decline.

A Methodology to Determine Training Data Size in ClassificationThis is based on the well-known learning curve, which in general is a plot of error versus training data size.

[17] and [18] are excellent references to learn more about learning curves in machine learning, and how they change with increasing bias or variance.

Python offers a learning curve function in scikit-learn [17].

In classification, we typically use a slightly different form of the learning curve; it is a plot of classification accuracy versus training data size.

The methodology for determining training data size is straightforward: Determine the exact form of the learning curve for your domain, and then, simply find the corresponding point on the plot for your desired classification accuracy.

For example, in references [19],[20], the authors use the learning curve approach in the medical domain and they represent it with a power law function:Learning curve equationwhere y is the classification accuracy, x is the training set, and b1,b2 correspond to the learning rate and decay rate.

The parameters change according to the problem domain, and they can be estimated using nonlinear regression or weighted nonlinear regression.

Is Growth of Training Data, The Best Way to Deal With Imbalanced Data?© hin255/AdobeStockThis question is addressed in [9].

The authors raise an interesting point; in the case of imbalanced data, accuracy is not the best measure of the performance of a classifier.

The reason is intuitive: Let us assume that the negative class is the dominant one.

Then we can achieve high accuracy, by predicting negative most of the time.

Instead, they propose precision and recall (also known as sensitivity) as the most appropriate measure of the performance for imbalanced data.

In addition to the apparent problem of accuracy described above, the authors claim that measuring precision is inherently more important for imbalanced domains.

For example, in a hospital alarm system [9], high precision means that when an alarm sounds, it is highly likely that there is indeed a problem with a patient.

Armed with the appropriate performance measure, the authors compared the imbalance correction techniques in package imbalanced-learn[21] (Python scikit-learn library) with simply using a larger training data set.

Specifically, they used K-Nearest neighbor with imbalance-correction techniques on a drug discovery-related dataset of 50,000 examples and then compared with K-NN on the original dataset of approximately 1 million examples.

The imbalance-correcting techniques in the above package include under-sampling, over-sampling and ensemble learning.

The authors repeated the experiment 200 times.

Their conclusion is simple and profound: No imbalance-correcting technique can match adding more training data when it comes to measuring precision and recall.

And with this, we have reached the end of our quick tour.

The references below can help you learn more about the subject.

Thank you for reading!References[1] The World’s Most Valuable Resource Is No Longer Oil, But Data, https://www.


com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data May 2017.

[2] Martinez, A.


, No, Data Is Not the New Oil, https://www.


com/story/no-data-is-not-the-new-oil/ February 2019.

[3] Haldan, M.

, How Much Training Data Do You Need?, https://medium.


haldar/how-much-training-data-do-you-need-da8ec091e956[4] Wikipedia, One in Ten Rule, https://en.


org/wiki/One_in_ten_rule[5] Van Smeden, M.

et al.

, Sample Size For Binary Logistic Prediction Models: Beyond Events Per Variable Criteria, Statistical Methods in Medical Research, 2018.

[6] Pete Warden’s Blog, How Many Images Do You Need to Train A Neural Network?, https://petewarden.

com/2017/12/14/how-many-images-do-you-need-to-train-a-neural-network/[7] Sullivan, L.

, Power and Sample Size Distribution, http://sphweb.




html[8] Wikipedia, Vapnik-Chevronenkis Dimension, https://en.


org/wiki/Vapnik%E2%80%93Chervonenkis_dimension[9] Juba, B.

and H.


Le, Precision-Recall Versus Accuracy and the Role of Large Data Sets, Association for the Advancement of Artificial Intelligence, 2018.

[10] Zhu, X.

et al.

, Do we Need More Training Data?.https://arxiv.


01508, March 2015.

[11] Shchutskaya, V.

, Latest Trends on Computer Vision Market, https://indatalabs.


716[12] De Berker, A.

, Predicting the Performance of Deep Learning Models, https://medium.

com/@archydeberker/predicting-the-performance-of-deep-learning-models-9cb50cf0b62a[13] Sun, C.

et al.

, Revisiting Unreasonable Effectiveness of Data in Deep Learning Era, https://arxiv.


02968, Aug.


[14] Hestness, J.

, Deep Learning Scaling is Predictable, Empirically, https://arxiv.



pdf[15] Joulin, A.

, Learning Visual Features from Large Weakly Supervised Data, https://arxiv.


02251, November 2015.

[16] Lei, S.

et al.

, How Training Data Affect the Accuracy and Robustness of Neural Networks for Image Classification, ICLR Conference, 2019.

[17] Tutorial: Learning Curves for Machine Learning in Python, https://www.


io/blog/learning-curves-machine-learning/[18] Ng, R.

, Learning Curve, https://www.


com/machinelearning-learning-curve/[19]Figueroa, R.


, et al.

, Predicting Sample Size Required for Classification Performance, BMC medical informatics and decision making, 12(1):8, 2012.

[20] Cho, J.

et al.

, How Much Data Is Needed to Train A Medical Image Deep Learning System to Achieve Necessary High Accuracy?, https://arxiv.


06348, January 2016.

[21] Lemaitre, G.

, F.

Nogueira, and C.


Aridas, Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning, https://arxiv.


06570.. More details

Leave a Reply