Review: V-Net — Volumetric Convolution (Biomedical Image Segmentation)

Review: V-Net — Volumetric Convolution (Biomedical Image Segmentation)Fully Convolutional Networks for Volumetric Medical Image SegmentationSH TsangBlockedUnblockFollowFollowingMar 16Slices from MRI volumes depicting prostate (PROMISE 2012 challenge dataset)Qualitative results (PROMISE 2012 challenge dataset)In this story, V-Net is briefly reviewed.

Most medical data used in clinical practice consists of 3D volumes, such as MRI volumes depicting prostate, while most approaches are only able to process 2D images.

3D image segmentation based on a volumetric, fully convolutional neural network is proposed in this work.

Prostate MRI volume segmentation is a challenging task due to the wide range of appearance, also different scanning approaches.

Deformation and variations of the intensity distribution are also happened.

Annotated medical volumes are not easy.

Experts are required for annotation, which yields a high cost.

Automatic segmentation can help to reduce the cost.

Prostate segmentation nevertheless is an important task having clinical relevance both during diagnosis, where the volume of the prostate needs to be assessed, and during treatment planning, where the estimate of the anatomical boundary needs to be accurate.

This is a 2016 3DV paper with more than 600 citations.

(SH Tsang @ Medium)V-Net ArchitectureDice LossResults1.

V-Net ArchitectureV-Net is shown as above.

The left part of the network consists of a compression path, while the right part decompresses the signal until its original size is reached.

As you can see, it is similar to U-Net, but with some differences.



LeftThe left side of the network is divided in different stages that operate at different resolutions.

Each stage comprises one to three convolutional layers.

At each stage, a residual function is learnt.

The input of each stage is used in the convolutional layers and processed through the non-linearities and added to the output of the last convolutional layer of that stage in order to enable learning a residual function.

This architecture ensures convergence compared with non-residual learning network such as U-Net.

The convolutions performed in each stage use volumetric kernels having size of 5×5×5 voxels.

(A voxel represents a value on a regular grid in 3D space.

The term voxel is commonly used in 3D many 3D space just like voxelization in point cloud.

)Along the compression path, resolution is reduced by convolution with 2×2×2 voxels wide kernels applied with stride 2.

Thus, the size of the resulting feature maps is halved, with similar purpose as pooling layers.

And the number of feature channels doubles at each stage of the compression path of the V-Net.

Replacing pooling operations with convolutional ones helps to have a smaller memory footprint during training, due to the fact that no switches mapping the output of pooling layers back to their inputs are needed for back-propagation.

Downsampling helps to increase the receptive field.

PReLU is used as non-linearity activation function.

(PReLU is suggested in PReLU-Net.



RightThe network extracts features and expands the spatial support of the lower resolution feature maps in order to gather and assemble the necessary information to output a two channel volumetric segmentation.

Convolution for Downsampling (Left), Deconvolution for Upsampling (Right)At each stage, a deconvolution operation is employed in order increase the size of the inputs followed by one to three convolutional layers, involving half the number of 5×5×5 kernels employed in the previous layer.

Residual function is learnt, similar to the left part of the network.

The two features maps computed by the very last convolutional layer, having 1×1×1 kernel size and producing outputs of the same size as the input volume.

These two output feature maps are the probabilistic segmentations of the foreground and background regions by applying soft-max voxelwise.



Horizontal ConnectionsSimilar to U-Net, location information is lost in the compression path (left).

Thus, the features extracted from early stages of the left part of the CNN are forwarded to the right part through horizontal connections.

This can help to provide location information to the right part, and improve the quality of the final contour prediction.

And these connections improve the convergence time of the model.


Dice LossAbove is the dice coeffcient D between two binary volumes.

(ranging between 0 and 1)With N voxels, pi: predicted voxels, gi: ground-truth voxels.

As mentioned, at the end of network after softmax, we got the outputs which are the probability of each voxel to belong to foreground and to background.

And dice can be differentiated yielding the gradient:Using dice loss, weights to samples of different classes to establish the right balance between foreground and background voxels are not needed.




TrainingAll the volumes processed by the network have the fixed size of 128×128×64 voxels and a spatial resolution of 1×1×1.

5 millimeters.

Dataset is small since one or more experts are required to manually trace a reliable ground truth annotation and that there is a cost associated with their acquisition.

Only 50 MRI volumes are used for training, with the relative manual ground truth annotation, are obtained from the PROMISE 2012 challenge dataset.

This dataset contains medical data acquired in different hospitals, using different equipment and different acquisition protocols, to represent the clinical variability.

Thus, data augmentation is needed.

For each iteration, randomly deformed versions of the training images by using a dense deformation field obtained through a 2×2×2 grid of control-points and B-spline interpolation.

Also, the intensity distribution of the data is varied by adapting, using histogram matching, the intensity distributions of the training volumes used in each iteration, to the ones of other randomly chosen scans belonging to the dataset.

Mini-batch, only 2 volumes, due to high memory requirement.



TestingUnseen 30 MRI volumes are processed.

The voxels after softmax, having higher probability (> 0.

5) to belong to the foreground than to the background are considered part of the anatomy.

Dice coefficient and Hausdorff distance are measured.

Hausdorff distance is to measure the shape similarity.

Hausdorff distance is to get the maximum distance between two shapes.

(If interested, a very brief introduction of Hausdorff distance is in CUMedVision2 / DCAN.

)As shown above, V-Net using dice loss outperforms V-Net with logistic loss.

And V-Net outperforms most of the prior arts but not Imorphics only.

Authors mentioned in the future work, they will aim at segmenting volumes containing multiple regions in other modalities such as ultrasound and at higher resolutions by splitting the network over multiple GPUs.

Reference[2016 3DV] [V-Net]V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image SegmentationMy Previous ReviewsImage Classification[LeNet] [AlexNet] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [MSDNet]Object Detection[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]Semantic Segmentation[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [PSPNet] [DeepLabv3] [DRN]Biomedical Image Segmentation[CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel]Instance Segmentation[SDS] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]Super Resolution[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet]Human Pose Estimation[Tompson NIPS’14].

. More details

Leave a Reply