Traditional Generalization Theory notions include VC dimension, Rademacher complexity and PAC-Bayes bounds.
VC Dimension and Dead NeuronsThe Vapnik–Chervonenkis (VC) dimension is a way of measuring the complexity of a class of functions by assessing how wiggly its members can be: the VC dimension of the class H is defined to be the largest number of points that can be shattered by members of H.
A set of points is said to be shattered by a class of functions if, no matter how we assign a binary label to each point, a member of the class can perfectly separate them.
Figure 1 — Ballooning number of parameters — yet the test error does not explode The experiments of Zhang et al.
 show that the true “parametric complexity” (which of course nobody knows right now) of deep nets trained on realistic data is only very crudely upperbounded by Bartlett’s VC calculation from 20+ years ago (# of nodes * # layers) .
This might not really be a surprise, for example, in light of the empirical evidence for dead neurons: when the size of the network is large enough and the non-linearity is chosen to be a ReLU, many weights are zero .
How can we estimate the VC of a model, then?PAC Learnability and Rademacher complexityThe definition of Probably Approximately Correct (PAC) learnability is straightforward: there exists an algorithm, that for every distribution D and Є, δ > 0, finds a hypothesis that is “Є-optimal” with probability 1−δ.
The existence of an algorithm for every distribution is a strong requirement: Rademacher complexity is instead defined for a specific yet unknown distribution D.
A sketched derivation of the Rademacher complexity based on In a nutshell, the Rademacher complexity measures the ability of an hypothesis class H to fit random ±1 binary labels.
If compared to the VC dimension, Rademacher complexity is distribution dependent and defined for any class of real-valued functions (not only discrete-valued functions).
As it was the case for Bartlett’s VC calculation, Rademacher complexity does not provide useful generalization bounds for Deep Learning.
Empirical tests suggest, in fact, that many neural networks fit the training set with random labels perfectly, we expect almost perfect Rademacher complexity for the corresponding model class H.
This is, of course, a trivial upper bound on the Rademacher complexity that does not lead to useful generalization bounds in realistic settings .
In other words, things don’t work out in theory and we are left with “alchemy” and some best practices: the only way to actually lower-bound Rademacher complexity of such complicated learning architectures is to try training a classifier, and detect lack of generalization via a held-out set.
Every practitioner in the world already does this (without realizing it), and kudos to Zhang et al.
() for highlighting that theory currently offers nothing better Towards novel approachesClassical approaches to generalization theory are hard to compute for today’s complicated ML models, let alone to use as a guide in designing learning systems .
Classical approaches to generalization theory are only descriptive: in other words, if generalization does not happen, we can justify this empirical finding by tapping into measures of complexity (VC dimension, Rademacher) but we do not have any prescriptive principle that could guide us .
Furthermore, empirical evidence proved that hypotheses classes that have high, even infinite capacity, might work well in practice.
This is true, not just for Deep Learning models but also for other Machine Learning methodologies: for example, SVM with some kernels (ex: Radial Basis Functions) are characterized by infinite VC Dimension.
Even simpler linear models can trick us: the hypothesis space of over-parameterized linear models can memorize any training data and decrease the training and test errors arbitrarily close to zero (including to zero) with the norm of parameters being arbitrarily large, even when the parameters are arbitrarily far from the ground-truth parameters Deep Learning stresses traditional approaches to generalization theory to the extreme: a main puzzle of deep networks revolves around the absence of overfitting despite large over-parameterization and despite the large capacity demonstrated by zero training error on randomly labeled data .
The overall idea of complexity is being revisited.
Focusing on Deep Learning, there are a number of novel approaches to generalization.
Norm-based capacity measuresOne approach is to look at capacity measures based on norm measures of the weight matrix normalized by the margin.
The output classification margin of a data sample is the difference of the value assigned by the model to the correct class less the maximal value over all the other classes.
Figure 2 — Note that path norms, sum over all possible paths going from the input to the output of the network, passing through one single neuron in each layer Norm based measures do not explicitly depend on the amount of parameters in the model and therefore have a better potential to represent its capacity : norm-based measures can explain the generalization of Deep Neural Networks (DNNs), as the complexity of models trained on the random labels is always higher than the complexity of models trained on the true labels, corresponding to the favorable generalization abilities of the latter Figure 3 — Different complexity measures of a VGG network (a Convolutional Neural Network architecture) on CIFAR10 data.
In all experiments, the training error of the learned network is zero.
The plots indicate that these measures can explain the generalization as the complexity of model learned with random labels is always higher than the one learned with true labels.
Furthermore, the gap between the complexity of models learned with true and random labels increases as we increase the size of the training set Another interesting capacity measure related to norm-based approaches is the Lipschitz constant of a network.
The Lipschitz constant is the product of the spectral norms of the weights matrices.
The spectral norm is the maximum singular value of a matrix: how much a matrix can stretch a vector .
Empirical evidence shows that the Lipschitz constant correlates with excess risk (test error minus training error).
However, this measure grows over time, despite excess risk plateauing ; growth than can be neutralized by, once again, normalizing by the Lipschitz constant by margins (see figure 4)Figure 4 — AlexNet trained with SGD on cifar10 Compression approachThe basic theorem of generalization states that if training set had m samples then the generalization error defined as the difference between error on training data and test data is of the order of sqrt(N’/m) where N’ is the number of effective parameters (or complexity measure) of the net [23, 24]Take a matrix C with N trainable parameters and try to compress it to another one C’ with fewer parameters (N’’) and roughly the same training error as C.
Then the basic theorem guarantees that so long as the number of training samples exceeds N’’, then C’ (the compressed net!) does generalize well.
[23, 24]I find the compression approach extremely compelling.
On one hand, we pin down the generalization bounds of DNNs.
On the other hand, we reap a wide range of practical and operational advantages:The deployment of smaller (trained) models to production has many benefits: smaller models are faster, consume less energy (important in mobile and embedded applications) and eat up less memory.
Recent research provided some empirical evidence of what has been dubbed as the “Lottery Ticket Hypothesis”: a randomly-initialized, dense neural network contains a sub-network (winning ticket) that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations .
The adoption of a training strategy capable of identifying winning tickets translates in a) faster learning b) higher test accuracy and c) … ;)Alongside the “Lottery Ticket Approach”, the are a lot of other interesting approaches to network compression.
One idea that I find particularly appealing is inspired by Tensor Networks: the idea to “Tensor Train” the weight matrices of the fully-connected layers of a DNN showed already promising empirical results .
 provides a survey of network compression methodologies although usually such compression involves retraining the compressed net, something that the approach to generalization theory based on the basic theorem and compression offered by [23,24] does not consider.
ConclusionIn order to be able to ensure the reliability of Deep Learning algorithms, we need to be able to estimate useful (tight) generalization bounds.
This is an open problem to which traditional approaches (VC dimension, Rademacher) fail to provide an answer, while novel approaches are being developed.
DisclaimerThe opinions expressed in this blog post are mine and only mine.
So are errors and inaccuracies.
eu/futurium/en/ai-alliance-consultation/guidelines/1#Robustness Poggio, Tomaso, et al.
“Theory of Deep Learning III: explaining the non-overfitting puzzle.
” arXiv preprint arXiv:1801.
[link] Sanjeev Arora, “Generalization Theory and Deep Nets, An introduction” (2017) [link] Desh Raj, Introduction to Learning Theory — Part 1 [link] Bartlett, Peter L.
, Dylan J.
Foster, and Matus J.
“Spectrally-normalized margin bounds for neural networks.
” Advances in Neural Information Processing Systems.
 Growing pains — 2018 Global CEO Outlook [link] Jordan Boyd-Graber, Classification: Rademacher Complexity, Machine Learning, Lecture 6, [link] Zhang, Chiyuan, et al.
“Understanding deep learning requires rethinking generalization.
” arXiv preprint arXiv:1611.
 Cosma Rohilla Shalizi, Undergraduate Capacity Control, 2012, [link] Yoshida, Yuichi, and Takeru Miyato.
“Spectral norm regularization for improving the generalizability of deep learning.
” arXiv preprint arXiv:1705.
 Boyd, S.
, EE263 Lecture notes (2015), Singular Value Decomposition [link] Shawe-Taylor, J.
and Rivasplata, O.
, “Statistical Learning Theory: A Hitchhiker’s Guide”, 2012, [link] Kawaguchi, Kenji, Leslie Pack Kaelbling, and Yoshua Bengio.
“Generalization in deep learning.
” arXiv preprint arXiv:1710.
[link] Vidal, Rene, et al.
“Mathematics of deep learning.
” arXiv preprint arXiv:1712.
[link] Jakubovitz, Daniel, Raja Giryes, and Miguel RD Rodrigues.
“Generalization error in deep learning.
” arXiv preprint arXiv:1808.
[link] Rob Schapire, COS 511: Theoretical Machine Learning Lecture Notes, Lecture 09, [link] Frankle, Jonathan, and Michael Carbin.
“The lottery ticket hypothesis: Finding sparse, trainable neural networks.
” arXiv preprint arXiv:1803.
 Novikov, Alexander, et al.
“Tensorizing neural networks.
” Advances in neural information processing systems.
 Zhou, Hattie, et al.
“Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask”.
Uber Engineering blog post.
[link] Bagnall, Alexander, and Gordon Stewart.
“Certifying the True Error: Machine Learning in Coq with Verified Generalization Guarantees.
[link] Neyshabur, Behnam, Ryota Tomioka, and Nathan Srebro.
“Norm-based capacity control in neural networks.
” Conference on Learning Theory.
[link] Neyshabur, Behnam, et al.
“Exploring generalization in deep learning.
” Advances in Neural Information Processing Systems.
 Belkin, Mikhail, et al.
“Reconciling modern machine learning and the bias-variance trade-off.
” arXiv preprint arXiv:1812.
[link] Bhaskara, CS 5966/6966: Theory of Machine Learning, Lecture 6 [link] Arora, Sanjeev, et al.
“Stronger generalization bounds for deep nets via a compression approach.
” arXiv preprint arXiv:1802.
[link] Arora, Sanjeev, Proving generalization of deep nets via compression (2018) [link] Cheng, Yu, et al.
“A survey of model compression and acceleration for deep neural networks.
” arXiv preprint arXiv:1710.