Review: MNC — Multi-task Network Cascade, Winner in 2015 COCO Segmentation (Instance Segmentation)

Review: MNC — Multi-task Network Cascade, Winner in 2015 COCO Segmentation (Instance Segmentation)Three Stages: Differentiating Instances, Estimating Masks, and Categorizing Objects.

SH TsangBlockedUnblockFollowFollowingJan 5This time, MNC (Multi-task Network Cascade), by Microsoft Research, is shortly reviewed.

The model consists of three networks, respectively differentiating instances, estimating masks, and categorizing objects.

These networks form a cascaded structure, and are designed to share their convolutional features.

MNC has won the 1st place in 2015 COCO segmentation challenge.

And it is published in 2016 CVPR with more than 300 citations.

(SH Tsang @ Medium)What Are CoveredMulti-task Network Cascades (MNC) Architecture (3 Stages)Cascades with More Stages (5 Stages)Results1.

Multi-task Network Cascades (MNC) Architecturehere are three stages: proposing box-level instances, regressing mask-level instances, and categorizing each instance as above.

Before going into each stage, convolutional feature maps are obtained by VGG16.

These convolutional feature maps are shared for all stages.



Regressing Box-level InstancesAt the first stage, the network structure and loss function of this stage follow the work of Region Proposal Networks (RPNs) in Faster R-CNN using convolutions.

On top of the shared features, a 3×3 convolutional layer is used for reducing dimensions, followed by two sibling 1×1 convolutional layers for regressing box locations and classifying object/non-object.

This loss function serves as the loss term L1 of the stage 1:where B is the network output of this stage.

Bi is a box indexed by i.

The box Bi is centered at (xi, yi) with width wi and height hi, and pi is the objectness probability.



Regressing Mask-level InstancesThe second stage takes the shared convolutional features and stage-1 boxes as input.

It outputs a pixel-level segmentation mask for each box proposal.

In this stage, a mask-level instance is still class-agnostic.

Given a box predicted by stage 1, 14×14-size ROI pooling is performed on the box.

Two extra fully-connected (fc) layers are applied to this feature for each box.

The first fc layer (with ReLU) reduces the dimension to 256, followed by the second fc layer that regresses a m×m (m=28) pixel-wise mask.

This mask performs binary logistic regression to the ground truth mask and serves as the loss term L2 of the stage 2:where M is the network output of this stage.

Compared with DeepMask, MNC only regresses masks from a few proposed boxes and so reduces computational cost.



Categorizing InstancesThe third stage takes the shared convolutional features, stage-1 boxes, and stage-2 masks as input.

It outputs category scores for each instance.

Given a box predicted by stage 1, we also extract a feature by RoI pooling.

This feature map is then “masked” by the stage-2 mask prediction.

This leads to a feature focused on the foreground of the prediction mask.

The masked feature is given by element-wise product.

FROI is the feature after ROI pooling.

M is the mask prediction obtained from stage 2.

Two 4096-d fc layers are applied on the masked feature FMask.

This is called mask-based pathway.

And the RoI pooled features directly fed into two 4096-d fc layers and formed the box-based pathway.

The mask-based and box-based pathways are concatenated.

On top of the concatenation, a softmax classifier of N+1 ways is used for predicting N categories plus one background category.

The box-level pathway may address the cases when the feature is mostly masked out by the mask-level pathway (e.


, on background).

The loss term L3:where C is the network output of this stage, which is a list of category predictions for all instances.

The loss of the network become:2.

Cascades with More Stages (5 Stages)5-stage MNCFirst, run the entire 3-stage network and obtain the regressed boxes on stage 3.

These boxes are then considered as new proposals.

Stages 2 and 3 are performed for the second time on these proposals.

This is in fact 5-stage inference.




PASCAL VOC 2012Ablation experiments on PASCAL VOC 2012 validation.

With VGG16 used for extracting the features but without sharing features among stages: 60.

2% mAP.

Sharing the features: 60.

5% mAP.

End-to-end training with 3 stages: 62.

6% mAP.

5-stages: 63.

5% mAP.

Comparison on PASCAL VOC 2012 validation for Instance Segmentation.

MNC obtains the highest mAP on different IoU thresholds of 0.

5 and 0.


The inference time is the shortest among the state-of-the-art approaches too.

Detailed Inference Time Per Image on Nvidia K40 GPU.

The most time-consuming part is VGG16 feature extraction (conv) part.

Evaluation of (box-level) object detectionAs boxes can be predicted by MNC as well, box-level object detection is evaluated.

MNC using union of 2007 trainval+test and 2012 trainval as training, the highest mAP of 75.

9% is obtained, which is substantially better than Fast R-CNN and Faster R-CNN.



MS COCOSegmentation result (%) on the MS COCO test-dev set.

Using VGG16 as backbone for feature extraction, 19.

5% mAP@[.


95] and 39.

7% mAP@0.

5 are obtained.

Using ResNet-101 as backbone for feature extraction, even higher mAPs are obtained, i.



6% mAP@[.


95] and 44.

3% mAP@0.


With global context modeling, multi-scale testing, and ensembling, the final results of 28.

2% mAP@[.


95] and 51.

5% mAP@0.

5 are obtained, and won the 1st place in the COCO segmentation track.



Qualitative ResultsPASCAL VOC 2012 Validation SetThere are the details about the differentiable ROI warping layer and also the details of network settings.

I haven’t mentioned here yet.

If interested, please visit the paper.

References[2016 CVPR] [MNC]Instance-aware Semantic Segmentation via Multi-task Network CascadesMy Related ReviewsImage Classification[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet]Object Detection[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [R-FCN] [ION] [MultiPath] [YOLOv1] [SSD] [YOLOv2 / YOLO9000] [DSSD]Semantic Segmentation[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [ParseNet] [DilatedNet] [PSPNet]Instance Segmentation[DeepMask] [SharpMask] [MultiPath].

. More details

Leave a Reply