Review: MultiPath / MPN — 1st Runner Up in 2015 COCO Detection & Segmentation (Object Detection / Instance Segmentation)Multiple network layers, foveal structure and integral loss, information flows along multiple paths in the networkSH TsangBlockedUnblockFollowFollowingJan 3In this story, MultiPath, by Facebook AI Research, is reviewed.
It is also called MPN in SharpMask.
Three modifications are made to improve Fast R-CNN:A foveal structure to exploit object context at multiple object resolutions.
Skip connections that give the detector access to features at multiple network layers.
An integral loss function and corresponding network adjustment that improve localization.
Couple with SharpMask object proposals, the combined system improves results over the baseline Fast R-CNN detector with Selective Search by 66% overall and by 4× on small objects.
It placed second in both the COCO 2015 detection and segmentation challenges.
It is published in 2016 BMVC with more than 100 citations.
(SH Tsang @ Medium)What Are CoveredFoveal StructureSkip ConnectionsIntegral Loss FunctionAblation StudyResultsMultiPath Architecture1.
Foveal StructureExcept the original 1× size of ROI pooling region, additional 1.
5×, 2× and 4× ROI pooling regions as shown above are also exploited.
This provides differently-sized foveal region.
These four ROI pooling regions go through fully connected (FC) layers (FC5 and FC6) and then concatenated as a single long feature vector (4096×4).
Skip ConnectionsIn original Fast R-CNN using VGG16 as backbone, only conv5 layer is used for ROI pooling.
At this layer, features have been downsampled by a factor of 16.
However, 40% of COCO objects have area less than 32×32 pixels and 20% less than 16×16 pixels, so these objects will have been downsampled to 2×2 or 1×1 at this stage, respectively.
RoI-pooling will upsample them to 7×7, but most spatial information will have been lost due to the 16 downsampling of the features.
Thus, skip pooling, suggested by ION, is performed at conv3, con4 and conv5 are used for ROI pooling as well.
The idea is that earlier layers usually have larger values than the later layers, which is mentioned in ParseNet.
Thus, each pooled ROI is L2-normalized and re-scale back up by empirically determined scale, prior to concatenation.
After that, 1×1 convolution is performed to reduce the dimension to fit the classifier input dimension.
These skip connections give the classifier access to information from features at multiple resolutions.
Integral Loss FunctionIn PASCAL and ImageNet datasets, the scoring metric is only focusing on Intersection over Union (IoU) over 50, i.
However, COCO dataset evaluate AP over a range of IoU from 50 to 95.
In original Fast R-CNN, the loss function only focus on optimizing AP⁵⁰:The first term Lcls is the classification log loss while the second term Lloc is the bounding box localization loss.
k*≥1 only if IoU greater than 50.
Otherwise k*=0 and the second term of loss is ignored.
In general, Lcls is modified to fit into the COCO evaluation metric:The above equation approximate the integral as a sum with du = 5.
Specifically, only 6 IoU thresholds, from 50, 55, …, to 75 are considered.
The modified loss become:where n=6 and u are from 50, 55, …, to 75.
During training, fewer proposals overlapping the ground truth as u is increased.
Therefore, it is restricted with u≤75, otherwise, the proposals contain too few total positive samples for training.
This integral loss function is shown at the right part of the above figure.
Some Training DetailsDuring training, there are 4 images per batch and 64 object proposals per image.
It takes about 3 days on 4 NVIDIA Titan X GPUs.
Non maximal suppression threshold of 30, 1000 proposals per image are used.
And there is no weight decay.
The network requires 150ms to compute features, 350ms to evaluate the foveal regions, thus a total of 500ms per COCO image.
Ablation StudyLeft: Model improvements of our MultiPath network, Right: 4-region foveal setup versus the 10 regions used in multiregionLeft: With foveal structure and skip connections, 46.
4% mAP is obtained for AP⁵⁰.
With integral loss, mAP is dropped to 44.
8% for AP⁵⁰ as expected.
This is because integral loss is specifically design for COCO evaluation metric.
Thus, we can see that there is improvement for overall AP from 27.
0 to 27.
9 when integral loss is used.
Right: multiregion  uses ten contextual regions around each object with different crops.
In MultiPath, only 4 foveal regions are used.
Without integral loss, MultiPath has 45.
2% mAP for AP⁵⁰.
With integral loss, 26.
9% overall mAP is obtained.
MultiPath is consistently better than multiregion.
Left: MultiPath with different IoU thresholds and with Integral loss, Right: Integral loss with different number of u.
Left: Each standard model performs best at the threshold used for training while using the integral loss yields good results at all settings.
Right: Integral loss achieves best AP with 6 heads.
Region Proposal TechniquesAP⁵⁰ and overall AP versus number and type of proposals.
AP⁵⁰ and overall AP with different approaches.
(SS: Selective Search, DM: DeepMask)In original Fast R-CNN, first step is to use Selective Search (SelSearch) to generate a number of region proposals.
For each proposal, ROI pooling is performed on conv5 and goes through FC layers for classification and localization.
Therefore, proposal technique is essential.
Results are saturated with around 400 DeepMask proposals per image.
Using just 50 DeepMask proposals matches accuracy with 2000 Selective Search proposals.
Additional Techniquestrainval: Adding COCO validation data for training.
hflip: horizontal flipping and average the results.
FMP: fractional max pooling, in brief, it is multiple ROI pooling operations with perturbed pooling parameters and average the softmax output.
ensembling: 6-model ensembling is used.
With above 4 techniques, both AP⁵⁰ and overall AP are greatly improved.
COCO 2015 Detection & SegmentationTop: Segmentation Results, Bottom: Detection ResultsMultiPath placed second in both Detection & Segmentation Challenges.
Overall AP on small objects is improved by 4× and AP⁵⁰ by 82%.
If ResNet backbone is used, AP could be further improved.
Qualitative ResultsWhile there are missing objects and false positives, many of them are quite good.
The results in the paper and in the COCO detection leaderboard are a bit different.
But the results in SharpMask are the same as in the leaderboard.
(I am not sure but) Perhaps, at the end, SharpMask, an improved DeepMask, is used as region proposal with MultiPath for submission.
References[2016 BMVC] [MultiPath / MPN]A MultiPath Network for Object DetectionImage Classification[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet]Object Detection[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [R-FCN] [ION] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000]Semantic Segmentation[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [ParseNet] [DilatedNet] [PSPNet]Instance Segmentation[DeepMask] [SharpMask].