Unfortunately, or perhaps for the better, computers still don't have that kind of intelligence to truly understand.
It depends of really a big amount of data to train such models but improvements are being made by mainly companies where image search plays an important role such as at Facebook's Rosetta and Google's Attend, Infer, Repeat.
SegmentationWith image segmentation, we can have a much more granular detection of objects in our images compared to the object detector's bounding boxes.
With segmentation, each and every pixel is classified.
On the upper half of the image, we find the output of what respectively an image classifier and object detector would predict.
Below, we find the semantic and instance segmentation's output.
Indeed, semantic segmentation will assign each pixel to a class but will not distinguish multiple occurrences within the same class, multiple sheep in this case, whereas instance segmentation makes this differentiation and identifies unique occurrences within a category.
Image segmentation is increasingly used in:detection of tumors & pathologiespedestrian & brake light detection in autonomous vehiclessatellite imaging recognitionastronomymanufacturingFeature-based alignmentAs explained earlier, extracting features from images one of the first steps and the next stage in many vision algorithms is to match these features across different images.
An important component of this matching is to verify whether the set of matching features is geometrically consistent, e.
, whether the feature displacements can be described by a simple 2D or 3D geometric transformation.
The computed motions can then be used in other applications such as image stitching, selfie filters, augmented reality, etc.
We will now discuss a particular application of feature-based alignment which is increasingly popular:Pose EstimationAs the name suggests, it tries to estimate an object’s 3D pose from a set of 2D point projections.
In the movie industry, this is already being used for character animation which can occur real-time.
Again, for autonomous vehicles this can be used to detect the alertness of a driver.
Also in healthcare, we can detect postural issues as scoliosis and in farms it is used to detect and prevent disease outbreaks.
Structure from motionWhat if we could recreate a 3D model from a video source?.This problem is know as structure from motion.
FactorizationThis method introduced in 1992 a novel technique to recreate a 3D model from a video by detecting key features, lock on those features as the motion happens creating a feature motion stream and from this motion stream recreate a 3D model.
Below we see a simple example:The features are dots on our rotating ball, we lock on the dots and analyse the flow.
From this we can recreate a model.
What if you want to reconstruct something bigger, let's say The Great Wall of China?.Apparently, it is quite hard to move it around.
In this case, you fly over/by it and capture a videostream and from this reconstruct a 3D-model.
Indeed, in lots of Augmented Reality applications, factorization is used to virtualize real-life objects such as from museums gallery, google maps, internet photos and even Youtube.
Dense motion estimationThis is arguably the oldest common used technique but also perhaps the least known.
It finds its applications in video compression, stabilisation and video summarisation.
If we take the above sequence of images, we can intuitively understand that some parts within the image stays the same along the sequence.
As such, information from one frame can be reused across multiple frames and thus reducing the video size.
Furthermore, if noise or other artefacts are present we can average or borrow information from adjacent frames and denoise our video.
The list of methodologies is vast for Dense motion estimation and while the concept is easy to understand, many techniques requires quite the knowledge in a broad array of topics.
This is partly due to the humongous amount of videos being consumed which according to Cisco will be 80% of total internet consumption.
Computational photographyIn a sense, everything which is discussed in this chapter can be seen as computational photography but here we will discuss old concepts which recently attained new levels in terms of photographic performance thanks to the use of Deep Learning techniques such as CNN's and GAN's.
If you don't know what CNN's or GAN's are that's totally ok, we will see this in another chapter so until then, see this as an appetizer of what to come.
Super ResolutionSuper-resolution occurs when images are created with higher spatial resolution and less noise than regular camera images.
Before deep learning, the process of aligning and combining several input images resulted in such high-resolution composites.
Another popular method was to upscale the image pixel wise and interpolate pixel values.
Then, the researchers at Twitter came with a GAN model which was named Super Resolution GAN and was presented as the first framework capable of inferring photo-realistic natural images for 4× upscaling factors.
Super Resolution was so hot that even the GoogleBrain team came up with their own model:left: the input of the model, middle: model's prediction and right: groundtruthSuper Resolution finds now its way in satellite image processing, healthcare, microscopy and in astrology.
ColorizationThis one is quite self-explanatory and is also getting increasingly robust.
How does it work?.In short, the semantics (context) of the scene and its surface texture provide ample cues for many colour regions in each image.
With this information, it is possible to create a colour classifier on pixel level and produce a plausible colorization that could potentially fool a human observer.
Texture analysis & synthesisTraditional approaches to texture analysis and synthesis boiled down into trying to match the spectrum of the source image while generating shaped noise.
This was on itself not sufficient and other (complex) techniques were applied with average results at best.
Then came Deep Learning to the rescue again with this notable research.
Once more GAN's, they are a big deal indeed, is the answer.
But with GAN's the question is more important: "How to have an algorithm that determines precisely whether an image is real or artificially constructed?".
If this question can be boiled down to equations and served to a GAN, it will produce results as seen below:Another popular GAN which is called Pix2Pix can translate one image to another: giving some new powerful tools for the increased human creativity:Stereo correspondence & RenderingThis is the process of taking two or more images and estimating a 3D model of the scene which happens by finding matching pixels in the images and converting their 2D positions into 3D depths.
Again, with Deep Learning there are methods available:That's pretty much it, for now.
Again, the goal was to give a broad overview of the capabilities and applications of computer vision.
Furthermore, I am completely aware that this list is not exhaustive at all and that some of you would like to dig deeper in the matter of the how.
As such, we're preparing a set of chapters dedicated to some of the aforementioned algorithms in which we will dig deeper into the Deep Learning techniques where Overture is the most passionate about.
In the next post, we will broach in an intuitive way the core elements which lays the foundations of modern computer vision algorithms.