Sure, there are.
The depthwise separable convolutions reduces the number of parameters in the convolution.
As such, for a small model, the model capacity may be decreased significantly if the 2D convolutions are replaced by depthwise separable convolutions.
As a result, the model may become sub-optimal.
However, if properly used, depthwise separable convolutions can give you the efficiency without dramatically damaging your model performance.
Flattened convolutionsThe flattened convolution was introduced in the paper “Flattened convolutional neural networks for feedforward acceleration”.
Intuitively, the idea is to apply filter separation.
Instead of applying one standard convolution filter to map the input layer to an output layer, we separate this standard filter into 3 1D filters.
Such idea is similar as that in the spatial separable convolution described above, where a spatial filter is approximated by two rank-1 filters.
The image is adopted from the paper.
One should notice that if the standard convolution filter is a rank-1 filter, such filter can always be separated into cross-products of three 1D filters.
But this is a strong condition and the intrinsic rank of the standard filter is higher than one in practice.
As pointed out in the paper “As the difficulty of classification problem increases, the more number of leading components is required to solve the problem… Learned filters in deep networks have distributed eigenvalues and applying the separation directly to the filters results in significant information loss.
”To alleviate such problem, the paper restricts connections in receptive fields so that the model can learn 1D separated filters upon training.
The paper claims that by training with flattened networks that consists of consecutive sequence of 1D filters across all directions in 3D space provides comparable performance as standard convolutional networks, with much less computation costs due to the significant reduction of learning parameters.
Grouped ConvolutionGrouped convolution was introduced in the AlexNet paper (link) in 2012.
The main reason of implementing it was to allow the network training over two GPUs with limited memory (1.
5 GB memory per GPU).
The AlexNet below shows two separate convolution paths at most of the layers.
It’s doing model-parallelization across two GPUs (of course one can do multi-GPUs parallelization if more GPUs are available).
This image is adopted from the AlexNet paper.
Here we describe how the grouped convolutions work.
First of all, conventional 2D convolutions follow the steps showing below.
In this example, the input layer of size (7 x 7 x 3) is transformed into the output layer of size (5 x 5 x 128) by applying 128 filters (each filter is of size 3 x 3 x 3).
Or in general case, the input layer of size (Hin x Win x Din) is transformed into the output layer of size (Hout x Wout x Dout) by applying Dout kernels (each is of size h x w x Din).
Standard 2D convolution.
In grouped convolution, the filters are separated into different groups.
Each group is responsible for a conventional 2D convolutions with certain depth.
The following examples can make this clearer.
Grouped convolution with 2 filter groups.
Above is the illustration of grouped convolution with 2 filter groups.
In each filter group, the depth of each filter is only half of the that in the nominal 2D convolutions.
They are of depth Din / 2.
Each filter group contains Dout /2 filters.
The first filter group (red) convolves with the first half of the input layer ([:, :, 0:Din/2]), while the second filter group (blue) convolves with the second half of the input layer ([:, :, Din/2:Din]).
As a result, each filter group creates Dout/2 channels.
Overall, two groups create 2 x Dout/2 = Dout channels.
We then stack these channels in the output layer with Dout channels.
Grouped convolution v.
depthwise convolutionYou may already observe some linkage and difference between grouped convolution and the depthwise convolution used in the depthwise separable convolution.
If the number of filter groups is the same as the input layer channel, each filter is of depth Din / Din = 1.
This is the same filter depth as in depthwise convolution.
On the other hand, each filter group now contains Dout / Din filters.
Overall, the output layer is of depth Dout.
This is different from that in depthwise convolution, which does not change the layer depth.
The layer depth is extended later by 1×1 convolution in the depthwise separable convolution.
There are a few advantages of doing grouped convolution.
The first advantage is the efficient training.
Since the convolutions are divided into several paths, each path can be handled separately by different GPUs.
This procedure allows the model training over multiple GPUs, in a parallel fashion.
Such model-parallelization over multi-GPUs allows more images to be fed into the network per step, compared to training with everything with one GPU.
The model-parallelization is considered to be better than data parallelization.
The later one split the dataset into batches and then we train on each batch.
However, when the batch size becomes too small, we are essentially doing stochastic than batch gradient descent.
This would result in slower and sometimes poorer convergence.
The grouped convolutions become important for training very deep neural nets, as in the ResNeXt shown belowThe image is adopted from the ResNeXt paper.
The second advantage is the model is more efficient, i.
the model parameters decrease as number of filter group increases.
In the previous examples, filters have h x w x Din x Dout parameters in a nominal 2D convolution.
Filters in a grouped convolution with 2 filter groups has (h x w x Din/2 x Dout/2) x 2 parameters.
The number of parameters is reduced by half.
The third advantage is a bit surprising.
Grouped convolution may provide a better model than a nominal 2D convolution.
This another fantastic blog (link) explains it.
Here is a brief summary.
The reason links to the sparse filter relationship.
The image below is the correlation across filters of adjacent layers.
The relationship is sparse.
The correlation matrix between filters of adjacent layers in a Network-in-Network model trained on CIFAR10.
Pairs of highly correlated filters are brighter, while lower correlated filters are darker.
The image is adopted from this article.
How about the correlation map for grouped convolution?The correlations between filters of adjacent layers in a Network-in-Network model trained on CIFAR10, when trained with 1, 2, 4, 8 and 16 filter groups.
The image is adopted from this article.
The image above is the correlation across filters of adjacent layers, when the model is trained with 1, 2, 4, 8, and 16 filter groups.
The article proposed one reasoning (link): “The effect of filter groups is to learn with a block-diagonal structured sparsity on the channel dimension… the filters with high correlation are learned in a more structured way in the networks with filter groups.
In effect, filter relationships that don’t have to be learned are on longer parameterized.
In reducing the number of parameters in the network in this salient way, it is not as easy to over-fit, and hence a regularization-like effect allows the optimizer to learn more accurate, more efficient deep networks.
”AlexNet conv1 filter separation: as noted by the authors, filter groups appear to structure learned filters into two distinct groups, black-and-white and color filters.
The image is adopted from the AlexNet paper.
In addition, each filter group learns a unique representation of the data.
As noticed by the authors of the AlexNet, filter groups appear to structure learned filters into two distinct groups, black-white filter and color filters.
Shuffled Grouped ConvolutionShuffled grouped convolution was introduced in the ShuffleNet from Magvii Inc (Face++).
ShuffleNet is a computation-efficient convolution architecture, which is designed specially for mobile devices with very limited computing power (e.
The ideas behind the shuffled grouped convolution are linked to the ideas behind grouped convolution (used in MobileNet and ResNeXt for examples) and depthwise separable convolution (used in Xception).
Overall, the shuffled grouped convolution involves grouped convolution and channel shuffling.
In the section about grouped convolution, we know that the filters are separated into different groups.
Each group is responsible for a conventional 2D convolutions with certain depth.
The total operations are significantly reduced.
For examples in the figure below, we have 3 filter groups.
The first filter group convolves with the red portion in the input layer.
Similarly, the second and the third filter group convolves with the green and blue portions in the input.
The kernel depth in each filter group is only 1/3 of the total channel count in the input layer.
In this example, after the first grouped convolution GConv1, the input layer is mapped to the intermediate feature map.
This feature map is then mapped to the output layer through the second grouped convolution GConv2.
Grouped convolution is computationally efficient.
But the problem is that each filter group only handles information passed down from the fixed portion in the previous layers.
For examples in the image above, the first filter group (red) only process information that is passed down from the first 1/3 of the input channels.
The blue filter group (blue) only process information that is passed down from the last 1/3 of the input channels.
As such, each filter group is only limited to learn a few specific features.
This property blocks information flow between channel groups and weakens representations during training.
To overcome this problem, we apply the channel shuffle.
The idea of channel shuffle is that we want to mix up the information from different filter groups.
In the image below, we get the feature map after applying the first grouped convolution GConv1 with 3 filter groups.
Before feeding this feature map into the second grouped convolution, we first divide the channels in each group into several subgroups.
The we mix up these subgroups.
After such shuffling, we continue performing the second grouped convolution GConv2 as usual.
But now, since the information in the shuffled layer has already been mixed, we essentially feed each group in GConv2 with different subgroups in the feature map layer (or in the input layer).
As a result, we allow the information flow between channels groups and strengthen the representations.
Pointwise grouped convolutionThe ShuffleNet paper (link) also introduced the pointwise grouped convolution.
Typically for grouped convolution such as in MobileNet (link) or ResNeXt (link), the group operation is performed on the 3×3 spatial convolution, but not on 1 x 1 convolution.
The shuffleNet paper argues that the 1 x 1 convolution are also computationally costly.
It suggests applying group convolution for 1 x 1 convolution as well.
The pointwise grouped convolution, as the name suggested, performs group operations for 1 x 1 convolution.
The operation is identical as for grouped convolution, with only one modification — performing on 1×1 filters instead of NxN filters (N>1).
In the ShuffleNet paper, authors utilized three types of convolutions we have learned: (1) shuffled grouped convolution; (2) pointwise grouped convolution; and (3) depthwise separable convolution.
Such architecture design significantly reduces the computation cost while maintaining the accuracy.
For examples the classification error of ShuffleNet and AlexNet is comparable on actual mobile devices.
However, the computation cost has been dramatically reduced from 720 MFLOPs in AlexNet down to 40–140 MFLOPs in ShuffleNet.
With relatively small computation cost and good model performance, ShuffleNet gained popularity in the field of convolutional neural net for mobile devices.
Thank you for reading the article.
Please feel free to leave questions and comments below.
ReferenceBlogs & articles“An Introduction to different Types of Convolutions in Deep Learning” (Link)“Review: DilatedNet — Dilated Convolution (Semantic Segmentation)” (Link)“ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices” (Link)“Separable convolutions “A Basic Introduction to Separable Convolutions” (Link)Inception network “A Simple Guide to the Versions of the Inception Network” (Link)“A Tutorial on Filter Groups (Grouped Convolution)” (Link)“Convolution arithmetic animation” (Link)“Up-sampling with Transposed Convolution” (Link)“Intuitively Understanding Convolutions for Deep Learning” (Link)PapersNetwork in Network (Link)Multi-Scale Context Aggregation by Dilated Convolutions (Link)Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs (Link)ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices (Link)A guide to convolution arithmetic for deep learning (Link)Going deeper with convolutions (Link)Rethinking the Inception Architecture for Computer Vision (Link)Flattened convolutional neural networks for feedforward acceleration (Link)Xception: Deep Learning with Depthwise Separable Convolutions (Link)MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (Link)Deconvolution and Checkerboard Artifacts (Link)ResNeXt: Aggregated Residual Transformations for Deep Neural Networks (Link).