Pooling can be used to down sample the content of feature maps, reducing their width and height whilst maintaining their salient features.
A problem with deep convolutional neural networks is that the number of feature maps often increases with the depth of the network.
This problem can result in a dramatic increase in the number of parameters and computation required when larger filter sizes are used, such as 5×5 and 7×7.
To address this problem, a 1×1 convolutional layer can be used that offers a channel-wise pooling, often called feature map pooling or a projection layer.
This simple technique can be used for dimensionality reduction, decreasing the number of feature maps whilst retaining their salient features.
It can also be used directly to create a one-to-one projection of the feature maps to pool features across channels or to increase the number of feature maps, such as after traditional pooling layers.
In this tutorial, you will discover how to use 1×1 filters to control the number of feature maps in a convolutional neural network.
After completing this tutorial, you will know:Let’s get started.
A Gentle Introduction to 1×1 Convolutions to Reduce the Complexity of Convolutional Neural NetworksPhoto copyright, some rights reserved.
This tutorial is divided into five parts; they are:Recall that a convolutional operation is a linear application of a smaller filter to a larger input that results in an output feature map.
A filter applied to an input image or input feature map always results in a single number.
The systematic left-to-right and top-to-bottom application of the filter to the input results in a two-dimensional feature map.
One filter creates one corresponding feature map.
A filter must have the same depth or number of channels as the input, yet, regardless of the depth of the input and the filter, the resulting output is a single number and one filter creates a feature map with a single channel.
Let’s make this concrete with some examples:The depth of the output of one convolutional layer is only defined by the number of parallel filters applied to the input.
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Download Your FREE Mini-CourseThe depth of the input or number of filters used in convolutional layers often increases with the depth of the network, resulting in an increase in the number of resulting feature maps.
It is a common model design pattern.
Further, some network architectures, such as the inception architecture, may also concatenate the output feature maps from multiple convolutional layers, which may also dramatically increase the depth of the input to subsequent convolutional layers.
A large number of feature maps in a convolutional neural network can cause a problem as a convolutional operation must be performed down through the depth of the input.
This is a particular problem if the convolutional operation being performed is relatively large, such as 5×5 or 7×7 pixels, as it can result in considerably more parameters (weights) and, in turn, computation to perform the convolutional operations (large space and time complexity).
Pooling layers are designed to downscale feature maps and systematically halve the width and height of feature maps in the network.
Nevertheless, pooling layers do not change the number of filters in the model, the depth, or number of channels.
Deep convolutional neural networks require a corresponding pooling type of layer that can downsample or reduce the depth or number of feature maps.
The solution is to use a 1×1 filter to down sample the depth or number of feature maps.
A 1×1 filter will only have a single parameter or weight for each channel in the input, and like the application of any filter results in a single output value.
This structure allows the 1×1 filter to act like a single neuron with an input from the same position across each of the feature maps in the input.
This single neuron can then be applied systematically with a stride of one, left-to-right and top-to-bottom without any need for padding, resulting in a feature map with the same width and height as the input.
The 1×1 filter is so simple that it does not involve any neighboring pixels in the input; it may not be considered a convolutional operation.
Instead, it is a linear weighting or projection of the input.
Further, a nonlinearity is used as with other convolutional layers, allowing the projection to perform non-trivial computation on the input feature maps.
This simple 1×1 filter provides a way to usefully summarize the input feature maps.
The use of multiple 1×1 filters, in turn, allows the tuning of the number of summaries of the input feature maps to create, effectively allowing the depth of the feature maps to be increased or decreased as needed.
A convolutional layer with a 1×1 filter can, therefore, be used at any point in a convolutional neural network to control the number of feature maps.
As such, it is often referred to as a projection operation or projection layer, or even a feature map or channel pooling layer.
Now that we know that we can control the number of feature maps with 1×1 filters, let’s make it concrete with some examples.
We can make the use of a 1×1 filter concrete with some examples.
Consider that we have a convolutional neural network that expected color images input with the square shape of 256x256x3 pixels.
These images then pass through a first hidden layer with 512 filters, each with the size of 3×3 with the same padding, followed by a ReLU activation function.
The example below demonstrates this simple model.
Running the example creates the model and summarizes the model architecture.
There are no surprises; the output of the first hidden layer is a block of feature maps with the three-dimensional shape of 256x256x512.
A 1×1 filter can be used to create a projection of the feature maps.
The number of feature maps created will be the same number and the effect may be a refinement of the features already extracted.
This is often called channel-wise pooling, as opposed to traditional feature-wise pooling on each channel.
It can be implemented as follows:We can see that we use the same number of features and still follow the application of the filter with a rectified linear activation function.
The complete example is listed below.
Running the example creates the model and summarizes the architecture.
We can see that no change is made to the width or height of the feature maps, and by design, the number of feature maps is kept constant with a simple projection operation applied.
The 1×1 filter can be used to decrease the number of feature maps.
This is the most common application of this type of filter and in this way, the layer is often called a feature map pooling layer.
In this example, we can decrease the depth (or channels) from 512 to 64.
This might be useful if the subsequent layer we were going to add to our model would be another convolutional layer with 7×7 filters.
These filters would only be applied at a depth of 64 rather than 512.
The composition of the 64 feature maps is not the same as the original 512, but contains a useful summary of dimensionality reduction that captures the salient features, such that the 7×7 operation may have a similar effect on the 64 feature maps as it might have on the original 512.
Further, a 7×7 convolutional layer with 64 filters itself applied to the 512 feature maps output by the first hidden layer would result in approximately one million parameters (weights).
If the 1×1 filter is used to reduce the number of feature maps to 64 first, then the number of parameters required for the 7×7 layer is only approximately 200,000, an enormous difference.
The complete example of using a 1×1 filter for dimensionality reduction is listed below.
Running the example creates the model and summarizes its structure.
We can see that the width and height of the feature maps are unchanged, yet the number of feature maps was reduced from 512 to 64.
The 1×1 filter can be used to increase the number of feature maps.
This is a common operation used after a pooling layer prior to applying another convolutional layer.
The projection effect of the filter can be applied as many times as needed to the input, allowing the number of feature maps to be scaled up and yet have a composition that captures the salient features of the original.
We can increase the number of feature maps from 512 input from the first hidden layer to double the size at 1,024 feature maps.
The complete example is listed below.
Running the example creates the model and summarizes its structure.
We can see that the width and height of the feature maps are unchanged and that the number of feature maps was increased from 512 to double the size at 1,024.
Now that we are familiar with how to use 1×1 filters, let’s look at some examples where they have been used in the architecture of convolutional neural network models.
In this section, we will highlight some important examples where 1×1 filters have been used as key elements in modern convolutional neural network model architectures.
The 1×1 filter was perhaps first described and popularized in the 2013 paper by Min Lin, et al.
in their paper titled “Network In Network.
”In the paper, the authors propose the need for an MLP convolutional layer and the need for cross-channel pooling to promote learning across channels.
This cascaded cross channel parametric pooling structure allows complex and learnable interactions of cross channel information.
— Network In Network, 2013.
They describe a 1×1 convolutional layer as a specific implementation of cross-channel parametric pooling, which, in effect, that is exactly what a 1×1 filter achieves.
Each pooling layer performs weighted linear recombination on the input feature maps, which then go through a rectifier linear unit.
[…] The cross channel parametric pooling layer is also equivalent to a convolution layer with 1×1 convolution kernel.
— Network In Network, 2013.
The 1×1 filter was used explicitly for dimensionality reduction and for increasing the dimensionality of feature maps after pooling in the design of the inception module, used in the GoogLeNet model by Christian Szegedy, et al.
in their 2014 paper titled “Going Deeper with Convolutions.
”The paper describes an “inception module” where an input block of feature maps is processed in parallel by different convolutional layers each with differently sized filters, where a 1×1 size filter is one of the layers used.
Example of the Naive Inception ModuleTaken from Going Deeper with Convolutions, 2014.
The output of the parallel layers are then stacked, channel-wise, resulting in very deep stacks of convolutional layers to be processed by subsequent inception modules.
The merging of the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage.
Even while this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.
— Going Deeper with Convolutions, 2014.
The inception module is then redesigned to use 1×1 filters to reduce the number of feature maps prior to parallel convolutional layers with 5×5 and 7×7 sized filters.
This leads to the second idea of the proposed architecture: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise.
[…] That is, 1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions.
Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose— Going Deeper with Convolutions, 2014.
The 1×1 filter is also used to increase the number of feature maps after pooling, artificially creating more projections of the downsampled feature map content.
Example of the Inception Module With Dimensionality ReductionTaken from Going Deeper with Convolutions, 2014.
The 1×1 filter was used as a projection technique to match the number of filters of input to the output of residual modules in the design of the residual network by Kaiming He, et al.
in their 2015 paper titled “Deep Residual Learning for Image Recognition.
”The authors describe an architecture comprised of “residual modules” where the input to a module is added to the output of the module in what is referred to as a shortcut connection.
Because the input is added to the output of the module, the dimensionality must match in terms of width, height, and depth.
Width and height can be maintained via padding, although a 1×1 filter is used to change the depth of the input as needed so that it can be added with the output of the module.
This type of connection is referred to as a projection shortcut connection.
Further, the residual modules use a bottleneck design with 1×1 filters to reduce the number of feature maps for computational efficiency reasons.
The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions.
— Deep Residual Learning for Image Recognition, 2015.
Example of a Normal and Bottleneck Residual Modules With Shortcut ConnectionsTaken from Deep Residual Learning for Image Recognition, 2015.
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you discovered how to use 1×1 filters to control the number of feature maps in a convolutional neural network.
Specifically, you learned:Do you have any questions?.Ask your questions in the comments below and I will do my best to answer.
…with just a few lines of python codeDiscover how in my new Ebook: Deep Learning for Computer VisionIt provides self-study tutorials on topics like: classification, object detection (yolo and rcnn), face recognition (vggface and facenet), data preparation and much more…Skip the Academics.
Click to learn more.