Predicting the performance of deep learning modelsPower-law scaling explains how a model’s performance will change as we feed it more dataArchy de BerkerBlockedUnblockFollowFollowingApr 14It’s widely acknowledged that the recent successes of Deep Learning rest heavily upon the availability of huge amounts of data.
Vision was the first domain in which the promise of DL was realised, probably because of the availability of large datasets such as ImageNet.
The recent surge of simulators for RL further illustrates that as we push further to apply these techniques to real-world problems, data scarcity quickly becomes the bottleneck.
But how much data is enough?In commercial contexts, this question comes up a lot.
When time and money is at stake, it’d be useful to be able to make some concrete statements about how improvements in model architecture are likely to weigh up against simply gathering more data.
Should we pay a team of engineers for 6 months to finesse our models, or should we pay a team of crowdsourced helpers for 6 months to collate the data we need?The fact that we can’t easily answer this question reflects the immaturity of deep learning as a field, a shortcoming that led Ali Rahimi to declare ‘Machine learning has become alchemy’ in a his 2018 NIPS talk (he’s wrong in one sense at least; alchemy never made anybody any money, whilst deep learning has made some people very rich indeed).
Yann LeCunn’s widely publicised Facebook post response laid down the gauntlet: ‘if you are not happy with our understanding of the methods you use everyday, fix it’.
Ali Rahimi putting the cat amongst the pigeons by suggesting that deep learning is ‘alchemy’A paper from Baidu, titled ‘Deep Learning Scaling is Predictable, Empirically’ , goes some way to answering this challenge.
As the title suggests, their answer to the question is an empirical, not a theoretical, one.
The paper is accompanied by an excellent blogpost, which I refer you to for a more detailed discussion of the findings, which I will summarise here.
Before we dive into it, a small digression: the study of scaling laws has fascinated biologists for a long time.
This plot, from Max Kleiber in 1947, shows that the metabolic rate of an animal (heat produced per day) scales in a log-log fashion (more on this below) with the body weight of that animal.
In fact, it seems to scale aswhich is why the red line is steeper than the one labelled surface — which isbut shallower than the one labelled weight.
Fascinatingly, nobody really knows why this law holds, although it seems very robust.
Back to Baidu and the world of artificial intelligence, and we are producing similar plots 70 years later:Essentially, the paper documents that increases in data produce decreases in test-set loss with the same power-law relationship, which ends up as a straight line when plotted on a log-log scale.
Fascinatingly, the exponent of this relationship — the slope of the line on the linear scale- ends up being more or less the same for any architecture you throw at the problem at hand.
So the datasets themselves define this exponent: the models merely shift the intercept.
To hammer this home: the effect of adding more data is essentially the same for any model, given the dataset.
That’s pretty extraordinary.
They don’t provide any code for the paper, so I threw together some experiments in PyTorch to explore their conclusions.
CodeYou can download the full Jupyter notebook here or read on for some gists.
I built on the code provided in the PyTorch tutorial to produce a simple CNN to test against the CIFAR dataset (a small image classification task with 10 classes).
I made it configurable with a hyperparameter dictionary because the optimal hyper parameters are very sensitive to dataset size — as we’ll see, this is important for replicating the Baidu results.
I split the training data into a training and a validation set, and subsampled the training set as suggested in the paper.
I then trained 9 models, one for each dataset size, with a stopping condition defined by increasing validation error for 3 epochs in a row (the original paper is a little vague on the specifics of validation).
I then evaluated each of them against the test set.
As you would expect, the test-set accuracy increases with the increasing size of the train set.
Moreover, it looks sort of power-law-ish.
The loss decreases, in a similar fashion.
However, neither the log-log plots of the accuracy or the loss look as cute as the one’s in the Baidu paper.
In fact, they each show some kind of vaguely logarithmic form themselves, suggesting that we have a sub-power law relationship.
The reason for this is fairly obvious: I didn’t do the exhaustive hyper-parameter search that they did at each training-set size.
As such, we’re not finding the best model for each dataset size.
Most probably, our models are lacking the capacity to fully capture the larger datasets, and we’re therefore not making best use of the data.
Adding hyperparameter tuningYou’ll remember that in the model definition we set the layer sizes using a hyper parameter dictionary, thus making it easy to fiddle with the shape of our network through hyper parameter tuning.
As such, it’s relatively straightforward for us to implement some random search:We can now repeat the training loop for each dataset size, sampling parameters at random:And use this to train a bunch of networks for each dataset size, keeping the one that performs best on the validation set.
I’m performing this tuning on MacBook without a GPU so I limited myself to 10 searches for each dataset size, hoping I could prove the point without requisitioning an AWS instance.
We can then go hunting for our power laws again, and sure enough, they’re looking a lot more dapper:Not quite as nice as Kleiber’s, but not bad.
ConclusionsThe original paper tests a variety of models in a variety of tasks — the closest to the one performed here is ImageNet with ResNets.
It’s pleasing to see that the results are so easily replicable with a different network, on a different dataset.
In their discussion, the authors note:We have yet to find factors that affect the power-law exponent.
To beat the power-law as we increase data set size, models would need to learn more concepts with successively less data.
This is precisely the kind of scaling that you see with humans; the more you know, the easier it is to acquire new knowledge.
I wrote previously about the difficulty of quantifying progress towards superintelligence.
It seems that the advent of models that beat the power-law exponent — that get more data efficient as they learn — might be an important empirical milestone on that path.
Originally published at deberker.
com on April 14, 2019.