Review: RefineNet — Multi-path Refinement Network (Semantic Segmentation)Outperforms FCN, DeconvNet, SegNet, CRF-RNN, DilatedNet, DeepLab-v1, DeepLab-v2 in Seven DatasetsSik-Ho TsangBlockedUnblockFollowFollowingApr 6In this story, RefineNet, by The University of Adelaide, and Australian Centre for Robotic Vision, is reviewed.
A generic multi-path refinement network that explicitly exploits all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections.
The deeper layers that capture high-level semantic features can be directly refined using fine-grained features from earlier convolutions.
A chained residual pooling is also introduced which captures rich background context in an efficient manner.
This is a 2017 CVPR paper with more than 400 citations.
(Sik-Ho Tsang @ Medium)OutlineProblems of ResNet and Dilated ConvolutionRefineNetAblation StudyComparison with State-of-the-art Approaches1.
Problems of ResNet and Dilated Convolution(a) ResNet (b) Dilated (Atrous) Convolution(a) ResNet: It suffers from downscaling of the feature maps which is not good for semantic segmentation.
(b) Dilated (Atrous) Convolution: It is introduced in DeepLab and DilatedNet.
Though it can help to keep the resolution of output feature maps larger, atrous filters are computationally expensive to train and quickly reach memory limits even on modern GPUs.
RefineNet(a): At the top left of the figure, it is the ResNet backbone.
Along the ResNet, different resolutions of feature maps go through Residual Conv Unit (RCU).
Pre-Activation ResNet is used.
(b) RCU: Residual block is used but with batch normalization removed.
(c) Fusion: Then multi-resolution fusion is used to merge the feature maps using element-wise summation.
(d) Chained Residual Pooling: The output feature maps of all pooling blocks are fused together with the input feature map through summation of residual connections.
It aims to capture background context from a large image region.
(a) Output Conv: At the right of the figure, finally, another RCU is placed here, to employ non-linearity operations on the multi-path fused feature maps to generate features for further processing or for final prediction.
Backbones, Chained Residual Pooling, and Multi-Scale EvaluationBackbones, Chained Residual Pooling, and Multi-Scale EvaluationWith deeper ResNet-152, Chained Residual Pooling, and test-time multi-scale evaluation, higher IoU is obtained for two datasets consistently.
Different RefineNet Variants(a) Single RefineNet model: It takes all four inputs from the four blocks of ResNet and fuses all-resolution feature maps in a single process.
(b) 2-Cascaded RefineNet: It employs only two RefineNet modules instead of four.
The bottom one, RefineNet-2, has two inputs from ResNet blocks 3 and 4, and the other one has three inputs, two coming from the remaining ResNet blocks and one from RefineNet-2.
(c) 4-Cascaded 2-Scale RefineNet: 2 scales of the image as input and respectively 2 ResNets to generate feature maps.
The input image is scaled to a factor of 1.
2 and 0.
6 and fed into 2 independent ResNets.
Different RefineNet Variants4-Cascaded 2-Scale RefineNet has the best results due to the larger capacity of the network, but it also results in longer training times.
Thus, 4-Cascaded RefineNet is used for comparison with state-of-the-art approaches.
Comparison with State-of-the-art Approaches4.
Person-PartPerson-PartPerson-Part dataset provides pixel-level labels for six person parts including Head, Torso, Upper/Lower Arms and Upper/Lower Legs.
The rest are background.
There are training 1717 images and 1818 test images.
RefineNet outperforms DeepLabv1 & DeepLabv2 with large margin.
NYUD-v2NYUD-v2It consists of 1449 RGB-D images showing interior scenes, with 40 classes.
The standard training/test split with 795 and 654 images is used.
Without using depth information for training, RefineNet outperforms FCN-32s.
PASCAL VOC 2012It includes 20 object categories and one background class.
It is split into a training set, a validation set and a test set, with 1464, 1449 and 1456 images each.
Conditional Random Field (CRF) method used in DeepLabv1 & DeepLabv2 for further refinement is tried, but with only marginal improvement of 0.
1% on the validation set.
Thus, CRF is not used for RefineNet.
It significantly outperforms FCN-8s, DeconvNet, CRF-RNN, and DeepLabv1 & DeepLabv2.
CityscapesCityscapes Test SetIt is a dataset on street scene images from 50 different European cities.
This dataset provides fine-grained pixel-level annotations of roads, cars, pedestrians, bicycles, sky, etc.
The provided training set has 2975 images and the validation set has 500 images.
19 classes are considered for training and evaluation.
Again, RefineNet outperforms FCN-8s, DeconvNet, and DeepLabv1 & DeepLabv2, as well as DilatedNet.
PASCAL-ContextPASCAL-ContextIt provides the segmentation labels of the whole scene for the PASCAL VOC images, with 60 classes (1 is background).
The training set contains 4998 images and the test set has 5105 images.
Again, RefineNet outperforms FCN-8s and DeepLabv2.
SUN-RGBDSUN-RGBDIt contains around 10,000 RGB-D indoor images and provides pixel labeling masks for 37 classes.
Without using depth information for training, RefineNet again is the best among all approaches.
ADE20K MITADE20K dataset (150 classes) val set.
It is a scene parsing dataset which provides dense labels of 150 classes on more than 20K scene images.
The categories include a large variety of objects (e.
, person, car, etc.
) and stuff (e.
, sky, road, etc.
The provided validation set consisting of 2000 images is used for quantitative evaluation.
Again, RefineNet is better than FCN-8s, SegNet and DilatedNet, and even a cascaded version of SegNet and DilatedNet.
Reference[2017 CVPR] [RefineNet]RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic SegmentationMy Previous ReviewsImage Classification[LeNet] [AlexNet] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [MSDNet]Object Detection[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]Semantic Segmentation[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [PSPNet] [DeepLabv3] [DRN]Biomedical Image Segmentation[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net]Instance Segmentation[SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]Super Resolution[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]Human Pose Estimation[DeepPose] [Tompson NIPS’14].