The emergence of Modern Conv NetsKarthik RajaBlockedUnblockFollowFollowingJul 5Photo by Alina Grubnyak on UnsplashFor humans, vision feels so easy since we do it all day long without thinking about it.
But if we think about just how hard the problem is, and how amazing it is that we can see.
To see the world, we have to deal with all sorts of “nuisance” factors, such as a change in pose or lighting.
Amazingly, the human visual system does this all so seamlessly that we don’t even have to think about it.
Computer Vision is a very active field of research, which tries to help the machines to see the world as humans do.
This field made tremendous progress in the last decade because of modern Deep Learning techniques and the availability of a large set of images online.
In this article, we are going to talk about object recognition the task of classifying an image into a set of object categories and how modern conv nets played a huge part.
ImageNet DatasetResearchers built ImageNet, a massive object recognition dataset consist of 1.
2 millions of images and has almost around 1000 of object categories.
Based on this dataset, ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is a competition that evaluates algorithms for object detection and image classification at a massive scale.
Due to the diverse object categories, usually one reports top five accuracy, where the algorithms make five different prediction for every image, and if one of the top five predictions are right, then we would consider the algorithm is working fine.
Now let’s talk about some of the architectures that revolutionized the image classification task.
LeNetLet’s look at a particular conv net architecture LeNet before the modern deep learning algorithms and the ImageNet dataset.
LeNet ArchitectureThe LeNet architecture is summarized in the following table; we can draw useful conclusions from that table:LeNet SummaryMost of the units are in the C1 (first convolution) layer.
Most of the connections are in C3 (second convolution) layer.
Most of the weights are in F5 (fully connected) layers.
Convolution layers are the most expensive part of the network in terms of running time in general.
Memory is another scarce resource; Backprop requires storing all of the activations in memory for training.
The activations don’t need to be stored in test time.
The weights constitute the vast majority of trainable parameters of the model.
LeNet was carefully designed to push the limits of all of these resource constraints using the computing power of 1998.
Try increasing the sizes of various layers and checking that you’re substantially enhancing the usage of one or more of these resources.
As we’ll see, conv nets have significantly grown larger to exploit new computing resources.
The Modern Conv Nets:AlexNet ArchitectureAlexNet was the conv net architecture, which started a revolution in computer vision by smashing the ILSVRC benchmark.
Like LeNet, it consists mostly of convolution, pooling, and fully connected layers.
AlexNet is 100 to 1000 times bigger than LeNet, but both of them had almost the same structure.
Moreover, like LeNet, most of the weights are in fully connected layers, and most of the connections are in convolutional layers.
Comparison of the early Conv NetsAll credits go to the sudden dramatical advances in the hardware, especially GPUs (Graphics Processing Units).
GPUs are geared towards high parallel processing; one of the things they do well is the matrix multiplication.
Since most the neural networks depend on the matrix multiplication, GPUs gave roughly a 30-fold speedup in practice for training neural nets.
AlexNet achieved a top-5 error of 28.
5%, which was substantiallybetter than the competitors.
The results prompted some of the world’s largest software companies to start up research labs focused on deep learning.
In 2013, the ILSVRC winner was based on tweaks to AlexNet.
In 2014, it was VGGNet, another conv net based on more or less similar principles.
The winning entry for 2014, GoogLeNet, or Inception, deserves mention.
As we can see, architecture has gotten more complicated since the days of LeNet.
But the interesting point is that they did a lot of work to reduce the number of trainable parameters (weights) from AlexNet’s 60 million, to about 2 million.
The reason has to do with saving memory at “test time.
”Inception ArchitectureTraditionally, there is no need for distinction between training and testing time because both training and testing are done on a single machine.
But at Google, the training could be distributed over lots of computers in a data center.
But the network was also supposed to be runnable on an Android cell phone so that images wouldn’t have to be sent to Google’s servers for classification.
On a cell phone, it would have been extravagant to spend 240MB to store AlexNet’s 60 million parameters, so it was crucial to cut down on parameters to make it fit in memory.
They achieved this in two ways.
First, they eliminated the fully connected layers, which we already saw contain most of the parameters in LeNet and AlexNet.
GoogLeNet is convolutions all the way.
This is analogous to how linear bottleneck layers can reduce the number of parameters.
They call this layer-within-a-layer architecture “Inception,” after the movie about dreams-within-dreams.
Performance on ImageNet improved astonishingly fast during the years the competition was run.
Here are the figures:It’s unusual for error rates to drop by a factor of 6 over a period of 5 years, especially on a task like an object recognition that hundreds of researchers had already worked hard on and where performance had seemed to plateau.
Human-performance is around 5.
They stopped running the object recognition competition because the performance is already so good.
Thanks to Prof.
Roger Grosse and Prof.
Jimmy Ba at the University of Toronto, for taking an excellent class on Neural Networks and Deep Learning.
Some of the contents are taken from their teaching notes for this article.
Reference: Roger Grosse, Jimmy Ba Lecture Slides http://www.