ESPNetv2 for Semantic SegmentationSachin MehtaBlockedUnblockFollowFollowingJun 7Nowadays, a number of real-world applications, such as autonomous vehicles, involves visual scene understanding.

Semantic segmentation is one of the main tasks that opens the way for visual scene understanding.

However, it is one of the most computationally expensive tasks in computer vision.

This article provides an overview of an efficient semantic segmentation network that is used in the ESPNetv2 paper.

ESPNet vs ESPNetv2: ESPNetv2 (accepted at CVPR’19) is a general purpose architecture that can be used for modeling both visual and sequential data.

ESPNetv2 extends ESPNet (accepted at ECCV’18) with depth-wise dilated separable convolutions and generalizes it across different tasks, including image classification, object detection, semantic segmentation, and language modeling.

Source code: Our source code along with pre-trained models on different datasets is available on the Github.

Semantic segmentationSemantic segmentation is a fine-grained inference task that predicts labels for each pixel inside an image.

Examples of foreground-background and full scene segmentation tasks are shown below.

Figure 1: Top row visualizes a foreground-background segmentation task (e.

g.

The PASCAL VOC dateset) while the bottom row visualizes a full scene segmentation task (e.

g.

The Cityscapes dataset).

Overview of encoder-decoder networksMost of the efficient segmentation networks, including ENet and U-Net, uses an encoder-decoder structure.

In simple words, the encoder-decoder structure comprises of two components: (1) encoder and (2) decoder.

The encoder takes an RGB image as an input and learns representations at multiple scales by performing convolutional and down-sampling operations.

As a consequence of down-sampling operations, spatial resolution and fine-grained details are lost.

The decoder inverts this loss by performing up-sampling and convolutional operations.

Below figure visualizes a vanilla encoder-decoder network.

Figure 2: A vanilla encoder-decoder network.

The green boxes in encoder and the decoder represents convolutional layers while the red and orange boxes represent down-sampling and up-sampling layers respectively.

As is a common practice, information is shared between the encoder and the decoder using skip connections.

These skip connections have been shown to be very effective.

See U-Net for more details.

Segmentation architecture in ESPNetv2Like the ESPNet, the ESPNetv2 also uses an encoder-decoder architecture for semantic segmentation, however, it uses more powerful and efficient encoding and decoding blocks: (1) Extremely Efficient Spatial Pyramid of Dilated Convolutions (EESP) module for the encoder and (2) Efficient Pyramid Pooling (EPP) module for the decoder.

The EESP module for the encoder: To be computationally efficient, the EESP module replaces the computationally expensive standard convolutional layers in the ESP module with efficient depth-wise dilated convolutions.

Figure 3 provides a comparison between the ESP and the EESP module.

For more details about these blocks, please our paper, ESPNetv2.

Figure 3: Comparison between the ESP module and the EESP module.

Each convolutional layer (Conv-n: n×n standard convolution, GConv-n:n×n group convolution, DConv-n: n×n dilated convolution, DDConv-n: n×n depth-wise dilated convolution) is denoted by (# input channels, # output channels, and dilation rate).

HFF denotes hierarchical feature fusion.

See ESPNet and ESPNetv2 papers for more details.

EPP module for the decoder: Sub-sampling allows learning scale-invariant representations.

These operations are very effective and are key components of different (and popular) computer vision algorithms, including SIFT and convolutional neural networks.

To enable ESPNetv2 to learn scale invariant representations efficiently, we introduced an efficient pyramid pooling (EPP) module, which is sketched in Figure 4.

To be efficient and effective, EPP projects N-dimensional feature maps to a low-dimensional space, say M-dimensional (N >> M) and then learn representations at different scales using depth-wise convolutions.

Let us assume that we have b branches.

We concatenate the output of these b branches to produce an output in bM-dimensional space.

To facilitate the learning of richer inter-scale representations, we first shuffle these bM-dimensional feature maps and then fuse them using a group convolution.

A point-wise convolution is then applied to learn linear combinations between the feature maps obtained after group convolution.

Note that the fundamental operation in both the EESP and the EPP is the same i.

e.

re-sampling.

In the EESP, re-sampling is achieved using the dilated convolutions, while in the EPP, re-sampling is achieved using up- and down-sampling operations.

Figure 4: EPP module allows to learn scale-invariant representations efficiently.

Point-wise, depth-wise, and group-wise convolutions are represented in blue, green, and purple respectively.

Comparison between the ESPNet and the ESPNetv2The qualitative performance comparison on the private test set using an online server of two widely used datasets, the Cityscapes and the PASCAL VOC 2012, is given in Table 1.

We can clearly see that ESPNetv2 is efficient and more accurate than the ESPNet.

Note that ESPNetv2 achieves a remarkable mean intersection over union (mIOU) score of 68 with an image size of 384×384; giving a competitive performance to many deep and heavy-weight segmentation architectures (see the PASCAL VOC 2012 leaderboard for more details).

The widely used image size on the PASCAL dataset is 512×512 (or 500×500).

Table 1: The performance of the ESPNet and the ESPNetv2 is compared in terms of FLOPs and the accuracy (in terms of mean intersection over union) on the private test set of two widely used datasets.

.