Residual blocks — Building blocks of ResNetUnderstanding a residual block is quite easy. In traditional neural networks, each layer feeds into the next layer. In a network with residual blocks, each layer feeds into the next layer and directly into the layers about 2–3 hops away. That’s it. But understanding the intuition behind why it was required in the first place, why it is so important and how similar it looks to some other state of the art architectures is where we are going to focus on. There are more than one interpretations of why residual blocks are awesome and how & why they are one of the key ideas that can make a neural network show state of the art performances on wide range of tasks. Before diving into the details, here is a picture of how a residual block actually looks like.Single Residual BlockWe know neural networks are universal function approximators and that the accuracy increases with increasing number of layers. But there is a limit to the number of layers added that result in accuracy improvement. So, if neural networks were universal function approximators then it should have been able to learn any simplex or complex function. But it turns out that, thanks to some problems like vanishing gradients and curse of dimensionality, if we have sufficiently deep networks, it may not be able to learn simple functions like an identity function. Now this is clearly undesirable.Also, if we still keep increasing the number of layers, we will see that the accuracy will start to saturate at one point and eventually degrade. And, this is usually not caused due to overfitting. So, it might seem that the shallower networks are learning better than their deeper counterparts and this is quite counter-intuitive. But this is what is seen in practice and is popularly known as the degradation problem.Without root causing into the degradation problem and the inability of a deep neural network to learn identity functions, let us start thinking about some possible solutions. In degradation problem, we know that shallower networks perform better than the deeper counterparts that have few more layers added to them. So, why not skip these extra layers and at-least match the accuracy of the shallow sub networks. But, how do you skip layers?You can skip the training of few layers using skip-connections or residual connections. This is what we see in the image above. In fact, if you look closely, we can directly learn an identity function by relying on skip connections only. This is the exact reason why skip connections are also called as identity shortcut connections too. One solution for all the problems!But why call it residual? Where is the residue? It’s time we let the mathematicians within us to come to the surface. Let us consider a neural network block, whose input is x and we would like to learn the true distribution H(x). Let us denote the difference (or the residual) between this asR(x) = Output — Input = H(x) — xRearranging it, we get,H(x) = R(x) + xOur residual block is overall trying to learn the true output, H(x) and if you look closely at the image above, you will realize that since we have an identity connection coming due to x, the layers are actually trying to learn the residual, R(x). So to summarize, the layers in a traditional network are learning the true output (H(x))whereas the layers in a residual network are learning the residual (R(x)). Hence, the name: Residual Block.It has also been observed that it is easier to learn residual of output and input, rather than only the input. As an added advantage, our network can now learn identity function by simply setting residual as zero. And if you truly understand backpropagation and how severe the problem of vanishing gradient becomes with increasing number of layers, then you can clearly see that because of these skip connections, we can propagate larger gradients to initial layers and these layers also could learn as fast as the final layers, giving us the ability to train deeper networks. The image below shows how we can arrange the residual block and identity connections for the optimal gradient flow. It has been observed that pre-activations with batch normalizations give the best results in general (i.e. the right-most residual block in the image below gives most promising results).Types of Residual BlockThere are even more interpretations of residual blocks and ResNets, other than the ones already discussed above. While training ResNets, we either train the layers in residual blocks or skip the training for those layers using skip connections. So for different training data points, different parts of networks will be trained at different rates based on how the error flows backwards in the network. This can be thought of as training an ensemble of different models on the dataset and getting the best possible accuracy.Skipping training in some residual block layers can be looked from an optimistic point of view too. In general, we do not know the optimal number of layers (or residual blocks) required for a neural network which might depend on the complexity of the dataset. Instead of treating number of layers an important hyperparameter to tune, by adding skip connections to our network, we are allowing the network to skip training for the layers that are not useful and do not add value in overall accuracy. In a way, skip connections make our neural networks dynamic, so that it may optimally tune the number of layers during training.The image below shows multiple interpretations of a residual block.Different interpretations of Residual BlockLet us go little into the history of skip connections. The idea of skipping connections between the layers was first introduced in Highway Networks. Highway networks had skip connections with gates that controlled how much information is passed through them and these gates can be trained to open selectively. This idea is also seen in the LSTM networks that control how much information flows from the past data points seen by the network. These gates work similar to control of memory flow from the previously seen data points. Same idea is shown in the image below.Similar to LSTM BlockResidual blocks are basically a special case of highway networks without any gates in their skip connections. Essentially, residual blocks allows the flow of memory (or information) from initial layers to last layers. Despite the absence of gates in their skip connections, residual networks perform as good as any other highway network in practice. And before ending this article, below is an image of how the collection of all residual blocks completes into a ResNet .ResNet architecturesIf you are interested in knowing more about ResNet overall and its different variants, checkout this article.