Review: STN — Spatial Transformer Network (Image Classification)With STN, Spatially Transformed Data within Network, Learn Invariance to Translation, Scale, Rotation and More Generic Warping.

SH TsangBlockedUnblockFollowFollowingJan 28In this story, Spatial Transformer Network (STN), by Google DeepMind, is briefly reviewed.

STN helps to crop out and scale-normalizes the appropriate region, which can simplify the subsequent classification task and lead to better classification performance as below:(a) Input Image with Random Translation, Scale, Rotation, and Clutter, (b) STN Applied to Input Image, (c) Output of STN, (d) Classification PredictionIt is published in 2015 NIPS with more than 1300 citations.

Spatial transformation such as affine transformation and homography registration has been studied for decades.

But in this paper, spatial transformation is coped with neural network.

With learning-based spatial transformation, transformation is applied conditioned on input or feature map.

And it is highly related to another paper called “Deformable Convolutional Networks” (2017 ICCV).

Thus, I decided to read this first.

(SH Tsang @ Medium)OutlineQuick Review on Spatial Transformation MatricesSpatial Transformer Network (STN)Sampling KernelExperimental ResultsSome Other Tasks1.

Quick Review on Spatial Transformation MatricesThere are mainly 3 transformation learnt by STN in the paper.

Indeed, more sophisticated transformation can also be applied as well.

1.

1 Affine TransformationAffine TransformDepending the values in the matrix, we can transform (X1, Y1) to (X2, Y2) with different effects, as follows:Translation, Scaling, Rotation, and ShearingIf interested, please Google “Registration”, “Homography Matrix”, or “Affine Transform”.

1.

2 Projective TransformationProjective transformation can also be learnt in STN as below:Projective Transformation1.

3.

Thin Plate Spline (TPS) TransformationThin Plate Spline (TPS) TransformationAn exampleFor TPS transformation, it is more complicated compared the previous two transformation.

(I have learnt affine and projective mapping before, but I haven’t touched about TPS, if there is mistakes, please tell me.

)To be brief, suppose we have a point (x, y) at a location other than the input points (xi, yi), we use the equations at the right to transform the point based on a bias, weighted sum of x and y, and a function of distance between (x, y) and (xi, yi).

(Here, a radial basis function RBF.

)Therefore, if we use TPS, the network needs to learn a0, a1, a2, b0, b1, b2, Fi, and Gi, which are 6+2N number of parameters.

As we can see, a more flexible or higher degree of freedom of deformation or transformation can be achieved by TPS.

2.

Spatial Transformer Network (STN)Affine TransformationSTN is composed of Localisation Net, Grid Generator and Sampler.

2.

1.

Localisation NetWith input feature map U, with width W, height H and C channels, outputs are θ, the parameters of transformation Tθ.

It can be learnt as affine transform as above.

Or to be more constrained such as the used for attention which only contains scaling and translation as below:Only scaling and translation2.

2.

Grid GeneratorSuppose we have a regular grid G, this G is a set of points with source coordinates (xs_i, ys_i), which act as input.

Then we apply transformation Tθ on G, i.

e.

Tθ(G).

After Tθ(G), a set of points with destination coordinates (xt_i, yt_i) is outputted.

These points have been altered based on the transformation parameters.

It can be Translation, Scale, Rotation or More Generic Warping depending on how we set θ as mentioned above.

2.

3.

Sampler(a) Identity Transformation, (b) Affine TransformationBased on the new set of coordinates (xt_i, yt_i), we generate a transformed output feature map V.

This V is translated, scaled, rotated, warped, projective transformed or affined, whatever.

It is noted that STN can be applied to not only input image, but also intermediate feature maps.

3.

Sampling KernelAs we can see in the example above, if we need to sample a transformed grid, we got sampling problem, how we sampling those sub-pel positions are depending on what sampling kernel we about to use.

General Form:Integer Sampling Kernel (by rounding to the nearest integer):Bilinear Sampling Kernel:It is a (sub-)differentiable sampling mechanism so that it is convenient for backpropagation:4.

Experimental Results4.

1.

Distorted MNISTDistortion applied: TC: translated and cluttered, R: rotated, RTS: rotated, translated, and scaled, P: projective distortion, E: elastic distortion.

Spatial transformers: Aff: Affine Transformation, Proj: Projective Transformation, TPS: Thin Plate Spline Transformation.

FCN: FCN here means Fully Connected Network without convolutions (It is NOT Fully Convolutional Network here.

)As we can see, ST-FCN outperforms FCN and ST-CNN outperforms CNN.

And ST-CNN is consistently better than ST-FCN in all settings.

4.

2.

SVHN (Street View House Number)ST-CNN Single: Only one ST at the beginning of network.

ST-CNN Multi: one ST before each conv.

Affine transformation is used here.

Similarly, ST-CNN outperforms Maxout and CNN.

(I have a very brief introduction of Maxout, please read it if interested.

)And ST-CNN Multi outperforms ST-CNN Single a bit.

4.

3.

Fine-Grained ClassificationHere, ImageNet Pre-trained Inception-v2 is used as backbone for classifying 200 species, which has 82.

3% accuracy.

2/4×ST-CNN: 2/4 parallel STs, with higher accuracy.

It is interesting that one ST (red) has learnt to be a head detector, with other 3 STs (green) learn the central part of the body of a bird.

5.

Some Other Tasks5.

1.

MNIST Addition2×ST-CNN: It is interesting that each of ST learns to transform each of a digit though each ST also receives two input digits.

5.

2.

Co-localisationTriplet loss: Hinge loss is used to enforce the distance between the two outputs of the ST to be less than the distance to a random crop, hoping to encourage the spatial transformer to localise the common objects.

5.

3.

Higher Dimensional TransformersSTN can also be extended to be 3D affine transformation.

There are different network architectures and settings for different datasets.

It is better to visit the paper if you want to know about the details.

Next, I will probably review about Deformable Convolutional Networks.

Reference[2015 NIPS] [ST]Spatial Transformer NetworksMy Related ReviewsImage Classification[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet]Object Detection[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [R-FCN] [ION] [MultiPathNet] [NoC] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [FPN] [RetinaNet]Semantic Segmentation[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [ParseNet] [DilatedNet] [PSPNet] [DeepLabv3]Biomedical Image Segmentation[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet]Instance Segmentation[DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]Super Resolution[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net].