Implementing a ResNet model from scratch.

A basic description of how ResNet works and a hands-on approach to understanding the state-of-the-art network.

Gracelyn ShiBlockedUnblockFollowFollowingJan 17When implementing the ResNet architecture in a deep learning project I was working on, it was a huge leap from the basic, simple convolutional neural networks I was used to.

One prominent feature of ResNet is that it utilizes a micro-architecture within it’s larger macroarchitecture: residual blocks!I decided to look into the model myself to gain a better understanding of it, as well as look into why it was so successful at ILSVRC.

I implemented the exact same ResNet model class in Deep Learning for Computer Vision with Python by Dr.

Adrian Rosebrock [1], which followed the ResNet model from the 2015 ResNet academic publication, Deep Residual Learning for Image Recognition by He et al.

[2].

ResNetWhen ResNet was first introduced, it was revolutionary for proving a new solution to a huge problem for deep neural networks at the time: the vanishing gradient problem.

Although neural networks are universal function approximators, at a certain threshold adding more layers makes training become slower and makes the accuracy saturate.

Source: https://towardsdatascience.

com/an-overview-of-resnet-and-its-variants-5281e2f56035This is due to the backpropagation of gradients as it goes from the final layers to the earliest ones — multiplying a number between 0 and 1 many times makes it increasingly smaller: thus the gradient begins to “disappear” when reaching the earlier layers.

That means the earlier layers are not only slower to train but are also more prone to error.

That’s a huge problem as the earliest layers are the building blocks of the whole network — they are responsible for identifying the basic, core features!To mitigate this problem, ResNet incorporates identity shortcut connections which essentially skip the training of one or more layers — creating a residual block.

A single residual block; the original one proposed by He et al.

Source: [1]The authors then proposed an “optimized” residual block, adding an extension called a bottleneck.

It would reduce the dimensionality in the first two CONV layers (1/4 of the filters learned in the final CONV layer) and then increase again during the final CONV layer.

Here are two residual modules stacked on top of each other.

Source: Deep Learning for Computer Vision using Python: Practitioner Bundle [1]Finally, He et al.

published a second paper on the residual module called Identity Mappings in Deep Residual Networks which provided an even better version of the residual block: the pre-activation residual model.

This allows the gradients to propagate through the shortcut connections to any of the earlier layers without hindrance.

Instead of starting with a convolution (weight), we start with a series of (BN => RELU => CONV) * N layers (assuming bottleneck is being used).

Then, the residual module outputs the addition operation that’s fed into the next residual module in the network (since residual modules are stacked on top of each other).

(a) original bottleneck residual module.

(e) full pre-activation residual module.

Called pre-activation because BN and ReLU layers occur before the convolutions.

Source: [2]The overall network architecture looked like this, and our model will be similar to it.

Source: [2]Let’s start coding the actual network in Python.

This specific implementation was inspired by both He et al.

in their Caffe distribution and the mxnet implementation from Wei Wu.

We’re going to write it as a class (ResNet) so we can call on it later while training a deep learning model.

We begin with our standard CNN imports, and then start building our residual_module function.

Take a look at the parameters:data: input to the residual moduleK: number of filters that will be learned by the final CONV layer (the first two CONV layers will learn K/4 filters)stride: controls the stride of the convolution (will help us reduce spatial dimensions without using max pooling)chanDim: defines the axis which will perform batch normalizationred (i.

e.

reduce) will control whether we are reducing spatial dimensions (True) or not (False) as not all residual modules will reduce dimensions of our spatial volumereg: applies regularization strength for all CONV layers in the residual modulebnEps: controls the Ɛ responsible for avoiding “division by zero” errors when normalizing inputsbnMom: controls the momentum for the moving averageNow let’s look at the rest of the function.

First, we initialize the (identity) shortcut (connection), which is really just a reference to the input data.

At the end of the residual module, we simply add the shortcut to the output of our pre-activation/bottleneck branch (Line 3).

On Lines 6–9, the first block of the ResNet module follows a BN ==> RELU ==> CONV ==> pattern.

The CONV layer utilises 1×1 convolutions by K/4 filters.

Notice that the bias term is turned off for the CONV layer, as the biases are already in the following BN layers so there’s no need for a second bias term.

As per the bottleneck, the second CONV layer learns K/4 filters that are 3 x 3.

The final block will increase dimensionality once again, applying K filters with the dimensions 1 x 1.

To avoid applying max pooling, we need to check if reducing spatial dimensions is necessary.

If we are commanded to reduce spatial dimensions, a convolutional layer with a stride > 1 will be applied to the shortcut (Lines 2–4).

Finally, we add together the shortcut and the final CONV layer creating the output to our ResNet module (Line 7).

We finally have the “building block” to begin constructing our deep residual network.

Let’s start building the build method.

Take a look at the parameters stages and filters (which are both lists).

In our architecture (shown above) we’re stacking N number of residual modules on top of each other (N = stage value).

Each residual module in the same stage learns the same number of filters.

After each stage learns its respective filters, it is followed by dimensionality reduction.

We repeat this process until we are ready to apply the average pooling layer and softmax classifier.

Stages and FiltersFor example, let’s set stages=(3, 4, 6) and filters=(64, 128, 256, 512).

The first filter (64) is applied to the only CONV layer not part of the residual module — the first CONV layer in the network.

Then, three (stage = 3) residual modules are stacked on top of each other — each one will learn 128 filters.

The spatial dimensions will be reduced, and then we stack four (stage = 4) residual modules on top of each other — each learning 256 filters.

Finally, we reduce spatial dimensions again and move on to stacking six (stage = 6) residual modules on top of each other, each learning 512 filters.

ResNet architecture.

Circled numbers are the filter values, while the brackets show the stacks.

Notice how there is a dimensionality reduction after every stage.

Unrelated to written example earlier.

Let’s go back to building the build method.

Initialize inputShape and chanDim based on whether we are using “channels last” or “channels first” ordering (Lines 3–4).

As mentioned above, ResNet uses a BN as the first layer as an added level of normalization to your input (Lines 2–4).

Then, we apply a CONV =>, BN => ACT => POOL to reduce the spatial size (Lines 7–13).

Now, let’s start stacking residual layers on top of each other.

To reduce volume size without using pooling layers, we can change the stride of the convolution.

The first entry in the stage will have a stride of (1, 1) — signaling the absence of downsampling.

Then, every stage after that we’ll apply a residual module with a stride of (2, 2) which will decrease the volume size.

This is shown on Line 5.

Then, we loop over the number of layers in the current stage (number of residual modules that will be stacked on top of each other) on Lines 10–13.

We use [i + 1] as the index into filters as the first filter was already used.

Once we’ve stacked stages[i] residual modules on top of each other, we return to the Lines 6–7 where we decrease the spatial dimensions of the volume and repeat the process.

To avoid dense fully-connected layers, we’ll apply average pooling instead to reduce volume size to 1 x 1 x classes:Finally, we’ll create a dense layer for the total number of classes we are going to learn and then apply a softmax activation to generate our final output probabilities!That concludes our build function, and now we have our fully constructed ResNet model!.You can call on this class to implement the ResNet architecture in your deep learning projects.

If you have any questions, feel free to comment down below or reach out!My Linkedin: https://www.

linkedin.

com/in/gracelyn-shi-963028aa/Email me at gracelyn.

shi@gmail.

comReferences[1] A.

Rosebrock, Deep Learning for Computer Vision with Python (2017)[2] K.

He, X.

Zhang, S.

Ren, and J.

Sun, Deep Residual Learning for Image Recognition (2015), https://arxiv.

org/abs/1512.

03385.. More details