Computer Vision : A study on different CNN architectures and their applications.
Yash upadhyayBlockedUnblockFollowFollowingJan 3We humans depend heavily on five senses for interpreting the world around us.
Though each of our senses is important we profusely depend on vision for most of the daily tasks like reading, driving or cooking.
Most of the times it’s the first thing we use while doing any task.
Eyes help us see the path we walk, the road we drive on, and checks for any possible collision.
Vision is so important, its only natural that it is also one of the things that humans want to recreate in the machines.
Humans drive for automation is not only for sake of reducing human error but also due to the fact that machines can work 24*7 tirelessly and without having any effect on its performance.
This is where Computer Vision comes in, Computer vision is an interdisciplinary scientific field which aims to make a computer process images and videos and extract details in the same way human mind does.
In the past decade research in computer vision has been in celerity, and though no machine has come closer to mimic the human brain, computer vision has helped us achieve some extraordinary results.
Artificial Neural Networks which have great capabilities for image pattern recognition are widely used in computer vision algorithms, as neural networks provide much better accuracy in comparison to conventional machine learning methods.
Real World ApplicationsIn modern-day, Computer Vision has found many areas where it can be utilized and automate processes in a way that it not only reduces human effort with also provide us with solutions to the task that could never have been solved by the limitations of the human vision.
HealthcarePhoto by rawpixel on UnsplashComputer vision is widely used in the diagnosis of diseases by processing the X-rays, MRIs and other medical images, and has proved to be as effective in the matter of precision as regular human doctors.
Health problems like pneumonia, brain tumor, diabetes, Parkinson’s diseases, breast cancer, and many others are being diagnosed with the help of computer vision.
With the help of state-of-the-art image processing techniques and computer vision, early-diagnosis of any plausible diseases will be possible, thus squashing the chance of them ever occurring or allowing treatment in a premature stage of the disease.
Computer vision has also helped researchers monitor a patient’s adherence to their prescribed treatments hence reducing attrition in Clinical Trials.
Computer Vision not only helps in diagnosis but also plays a role in surgery by analyzing damages to the tissues and monitoring the blood loss of the patient.
AutomobilesPhoto by Saketh Garuda on UnsplashAutomobile industries, with the increased hype of the self-driving cars, are heavily dependent on Computer Vision, as its means for understanding the driving environment, which would include, detecting obstacles, pedestrians, lanes, and possible collision paths.
Computer vision is now also used in driver assistants, which helps the driver by notifying him of certain situations.
It also monitors the driver for correct behavior and driving pattern to reduce the chances of accidents that are caused due to negligence.
These include checking if he is not driving rashly, is not under influence of alcohol or drugs, and if he is drowsy or not.
Computer vision also plays a role in automated productions of cars where it rejects the defective components on the assembly line.
Security and SurveillancePhoto by Veit Hammer on UnsplashThese days our societies, metro stations, roads, school hospitals, in short, every building that needs constant surveillance, has a network of closed-circuit cameras, but as human guards can only monitor a limited number of cameras for a limited period of time, most often these cameras are used just as evidence against a certain crime rather than being a tool in averting that crime.
Computer vision counters this problem — security system with computer visions capabilities are being used, which are able to detect crime like violence, theft, trespassing, and using face recognition it can also be used to find criminals in crowded areas like airports and train stations.
AstronomyPhoto by Rahul Bhosale on UnsplashAll our knowledge about the universe derived from measurements of photons which are mostly images, which opens the possibility of application of Computer vision in astronomy, as our universe is so vast its only natural that the data collected will also be large, studying this data manually won’t be possible for the astronomer instead using computer vision we can study this data at a much faster rate as of now Computer vision is being used for discovering new planets and heavenly bodies, this includes application like exoplanet imaging, star and galaxy classification.
AgricultureMachine Learning and Computer Vision in farmingIn agriculture, Computer vision is used for finding if the seed which is being used is healthy or not.
Using hyperspectral or multispectral sensors the health of the crops can also be determined.
It can also help in identifying the areas with fertile soil, presences of water bodies, hence determining which areas are suitable for agriculture.
Computer vision is also enabling robots to carry out processes such as harvesting, planting, weeding.
The autonomous tractors which are dependent on machine vision are used to reduce the stress on the farmers.
Computer vision can also be used to identify livestock and monitor their growth over the course of their lifetime to provide important information about progress towards harvesting.
IndustrialComputer Vision aiding in manufacturing processes.
In Industry, Computer vision is used on the assembly line for counting batch, detecting damaged components, for the inspection of the finished goods machine vision tools to find microscopic level defects in products that simply cannot be identified using human vision and improving safety in the factory environment.
In manufacturing tasks reading barcodes are essential as they provide a unique identification to a product, reading thousands of barcodes in a day is not an easy task for humans but using Computer Vision it can be done easily in minutes.
Satellite ImageryPhoto by SpaceX on UnsplashComputer vision is applied to satellite images to detect natural hazards like floods, tsunamis, hurricanes, and landslides.
Satellite images are also used to analyze pollution and air quality index of areas of focus.
It can be used to detect various materials inland, recently the use of computer vision in mining industries has started to detect areas with the high possibility of having crude oil or minerals as manual mining for just checking the presence of ore can be costly and lead to astronomical waste of money.
Now as we have Introduced what is computer vision and what are its applications we will look at some of the algorithms that are used in computer vision.
Convolutional Neural Networks (CNN)Most of the computer vision tasks are surrounded around CNN architectures, as the basis of most of the problems is to classify an image into known labels.
Algorithms For object detection like SSD(single shot multi-box detection) and YOLO(You Only Look Once) are built around CNN.
CNN ArchitectureArtificial neural networks were great for the task which wasn’t possible for Conventional Machine learning algorithms but in case of processing images with fully connected hidden layers ANN very long time to be trained, due to this CNN was used to first reduces the size of images using convolutional layers and pooling layers then feeding these reduced data to fully connected layers.
CNN is used not only in computer vision but also for text classification in Natural Language Processing.
Let’s talk about layers of CNN.
Convolution layerConvolution OperationTo perform convolution operation a filter (A smaller matrix)is used whose size can be specified, this filter moves all over the image matrix, The filter’s task is to multiply its values by the original pixel values.
All these multiplications are summed up.
One number is obtained in the end.
it moves further and further right by n units(can vary) performing a similar operation.
After passing the filter across all positions, a matrix is obtained, but smaller than the input matrix.
Nonlinear layerIts added after each convolution layer.
It uses an activation function, to bring non-linearity to data.
Non-linearity means that the change of the output is not proportional to the change of the input, we require this nonlinearity as if the network was linear there would be no point in adding multiple layers (multiple linear layers are equivalent to a single layer).
By increasing the nonlinearity we can make a complex network which will be able to find new patterns in the image.
The activation function can be ReLu, Tanh or any other nonlinear activation function.
Read more about activation functions here.
Pooling LayerMax PoolingPooling layer is used to further downsize the matrix.
The most common form is a pooling layer with filters of size 2×2 applied with a stride of 2 downsamples at every depth slice in the input by 2 along both width and height, discarding 75% of the activations.
Pooling layer is generally used to select the most important pixels by using Max pooling function which only selects the highest value pixel present in the filter, this reduces the amount of computation required for training, hence reducing the time taken for training the neural network significantly.
Pooling can be done in various ways:Max pooling: The largest element in the matrix is selected.
Min pooling: The smallest element in the matrix is selected.
Mean pooling: The mean of the elements of the matrix is selected.
Average pooling: The Average of the elements of the matrix is selected.
Fully Connected LayersMultilayer perceptron neural networkFully connected layers connect every neuron in one layer to every neuron in another layer.
It is in principle the same as the traditional multi-layer perceptron neural network (MLP), the only difference is that the input layer of the MLP takes input from the out previous layers of the CNN.
CNN Based ArchitecturesMany CNN based architectures have been used to maximize performance in image classification.
These architectures are of the famous architecture are discussed below :AlexNet (2012)AlexNet was designed by the SuperVision group, consisting of Alex Krizhevsky, Geoffrey Hinton, and Ilya Sutskever.
It was the winner of the 2012 ImageNet LSVRC-2012 competition which is a yearly competition focused on image classification, with an error rate of 15.
AlexNet used Relu(Rectified linear unit) instead of tanh activation to add non-linearity which accelerated the speed by 6 times meanwhile also increasing accuracy.
It also used dropout instead of regularisation to deal with overfitting, Another feature of AlexNet was that it had overlap pooling to reduce the size of the network.
It reduces the top-1 and top-5 error rates by 0.
4% and 0.
AlexNet ArchitectureAlexNet had 5 convolution layers and 3 fully connected layers, in non-linear layers which are present after every Convolutional layer and fully connected layers, Relu activation function is used.
Dropout is applied only before the first and the second fully connected layer.
The network has 62.
3 million parameters and needs 1.
1 billion computation units in a forward pass.
In the paper for AlexNet it is specified that the network takes 90 epochs in five or six days to train on two GTX 580 GPUs.
using stochastic gradient descent with learning rate 0.
01, momentum 0.
9 and weight decay 0.
0005 is used.
Learning rate is divided by 10 once the accuracy plateaus.
The learning rate is decreased 3 times during the training process.
GoogLeNet/Inception(2014)Architecture for GoogLeNetGoogLeNet is the winner of the ILSVRC 2014, It achieved a top-5 error rate of 6.
67%, The network used a CNN inspired by LeNet.
Its architecture contains 1×1 Convolution at the middle of the network and global average pooling is used at the end of the network instead of using fully connected layers.
It also makes use of the Inception module, is to have different sizes/types of convolutions for the same input and stacking all the outputs.
It also used batch normalization, image distortions, and RMSprop.
In GoogLenet 1×1 convolution is used as a dimension reduction module to reduce the computation, by reducing the computation bottleneck, depth and width can be increased.
GoogLenet’s architecture consisted of a 22 layer deep CNN but reduced the number of parameters from 60 million (AlexNet) to 4 million.
VGGNet (2014)Architecture for VGG NetVGGNet was invented by VGG (Visual Geometry Group) from the University of Oxford, Though VGGNet is the 1st runner-up, not the winner of the ILSVRC 2014 in the classification task, It still showed a significant improvement to the previous Networks.
VGGNet consists of 16 convolutional layers and is very appealing because of its very uniform architecture.
Similar to AlexNet, only 3×3 convolutions, but lots of filters.
It is mostly used for extracting features from images.
VGG-16 is used as a base for object detection algorithm SSD, without fully connected layers.
ResNet(2015)Architecture for ResnetResidual Neural Network (ResNet) by Kaiming He et al, won the ILSVRC 2015.
It achieved a top-5 error rate of 3.
57% which beats human-level performance on this dataset.
It introduced an architecture which consists of 152 layers with skip connections(gated units or gated recurrent units) and features heavy batch normalization.
The whole idea of ResNet is to counter the problem of vanishing gradients.
By preserving the gradients, Vanishing gradients is the problem that occurs in networks with high number of layers as the weights of the first layers cannot be updated correctly through the backpropagation of the error gradient (the chain rule multiplies error gradient values lower than one and then, when the gradient error comes to the first layers, its value goes to zero).
Challenges for Computer VisionComputer vision is heavily dependent on the quality of images, the factors like what camera was used, what time of the day was the image/video taken, and if the camera was stable.
Applications like facial recognition and video analysis face huge problems as the quality of CCTV are very low and cannot be used to distinguish people.
In the case of object detection, the size of the objects plays important role in the model’s accuracy, small objects aren’t easily detected, even if they are detected, the detection is unstable.
It is also affected by deformation of the objects, Background of the image and the extent of occlusion.
Another factor that causes hindrance to computer vision is the Knowledge of the model, if an object or image which wasn’t present in the training set, the model will only show incorrect results.
This can be a problem, for example, a weapons detection system is deployed at a railway station which is only trained for guns and knives, and the terrorists bring in bombs which can go undetected through the system hence putting lives in danger.