Review: YOLOv3 — You Only Look Once (Object Detection)Improved YOLOv2, Comparable Performance with RetinaNet, 3.
8× Faster!SH TsangBlockedUnblockFollowFollowingFeb 7YOLOv3In this story, YOLOv3 (You Only Look Once v3), by University of Washington, is reviewed.
YOLO is a very famous object detector.
I think everybody must know it.
Below is the demo by authors:YOLOv3As author was busy on Twitter and GAN, and also helped out with other people’s research, YOLOv3 has few incremental improvements on YOLOv2.
For example, a better feature extractor, DarkNet-53 with shortcut connections as well as a better object detector with feature map upsampling and concatenation.
And it is published as a 2018 arXiv technical report with more than 200 citations.
(SH Tsang @ Medium)OutlineBounding Box PredictionClass PredictionPredictions Across ScalesFeature Extractor: Darknet-53Results1.
Bounding Box PredictionBounding Box Prediction, Predicted Box (Blue), Prior Box (Black Dotted)It is the same as YOLOv2.
tx, ty, tw, th are predicted.
During training, sum of squared error loss is used.
And objectness score is predicted using logistic regression.
It is 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior.
Only one bounding box prior is assigned for each ground truth object.
Class PredictionSoftmax is not used.
Instead, independent logistic classifiers are used and binary cross-entropy loss is used.
Because there may be overlapping labels for multilabel classification such as if the YOLOv3 is moved to other more complex domain such as Open Images Dataset.
Prediction Across Scales3 different scales are used.
Features are extracted from these scales like FPN.
Several convolutional layers are added to the base feature extractor Darknet-53 (which is mentioned in the next section).
The last of these layers predicts the bounding box, objectness and class predictions.
On COCO dataset, 3 boxes at each scales.
Therefore, the output tensor is N×N×[3×(4+1+80)], i.
4 bounding box offsets, 1 objectness prediction, and 80 class predictions.
Next, the feature map is taken from 2 layers previous and is upsampled by 2×.
A feature map is also taken from earlier in the network and merge it with our upsampled features using concatenation.
This is actually the typical encoder-decoder architecture, just like SSD is evolved to DSSD.
This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map.
Then, a few more convolutional layers are added to process this combined feature map, and eventually predict a similar tensor, although now twice the size.
k-means clustering is used here as well to find better bounding box prior.
Finally, on COCO dataset, (10×13), (16×30), (33×23), (30×61), (62×45), (59×119), (116×90), (156×198), and (373×326) are used.
Feature Extractor: Darknet-53Darknet-53Darknet-19 classification network is used in YOLOv2 for feature extraction.
Now, in YOLOv3, a much deeper network Darknet-53 is used, i.
53 convolutional layers.
Both YOLOv2 and YOLOv3 also use Batch Normalization.
Shortcut connections are also used as shown above.
1000-Class ImageNet Comparison (Bn Ops: Billions of Operations, BFLOP/s: Billion Floating Point Operation Per Second, FPS: Frame Per Second)1000-class ImageNet Top-1 and Top5 error rates are measured as above.
Single Crop 256×256 image testing is used, on a Titan X GPU.
Compared with ResNet-101, Darknet-53 has better performance (authors mentioned this in the paper) and it is 1.
Compared with ResNet-152, Darknet-53 has similar performance (authors mentioned this in the paper) and it is 2× faster.
5As shown above, compared with RetinaNet, YOLOv3 got comparable mAP@0.
5 with much faster inference time.
For example, YOLOv3–608 got 57.
9% mAP in 51ms while RetinaNet-101–800 only got 57.
5% mAP in 198ms, which is 3.
COCO Overall mAPOverall mAPFor overall mAP, YOLOv3 performance is dropped significantly.
Nevertheless, YOLOv3–608 got 33.
0% mAP in 51ms inference time while RetinaNet-101–50–500 only got 32.
5% mAP in 73ms inference time.
And YOLOv3 is on par with SSD variants with 3× faster.
DetailsMore DetailsYOLOv3 is much better than SSD variants.
And it is found that YOLOv3 has relatively good performance on AP_S but relatively bad performance on AP_M and AP_L.
Qualitative ResultsNearly Exactly The Same Between Predicted Boxes and Ground-Truth BoxesActually, there are not much details on YOLOv3 in the technical report.
Thus, I can only briefly review about it.
It is recommended to be back and forth between YOLOv2 and YOLOv3 when reading YOLOv3.
(And there are passages talking about the measurement of overall mAP.
“Is it really reflecting the actual detection accuracy?” If interested, please visit the paper.
)Reference[2018 arXiv] [YOLOv3]YOLOv3: An Incremental ImprovementMy Previous ReviewsImage Classification[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet]Object Detection[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [R-FCN] [ION] [MultiPathNet] [NoC] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [FPN] [RetinaNet] [DCN]Semantic Segmentation[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [ParseNet] [DilatedNet] [PSPNet] [DeepLabv3]Biomedical Image Segmentation[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet]Instance Segmentation[DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]Super Resolution[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN].. More details