This is the process which capsule networks are based on, inverse rendering.
Let’s take a look at capsules and how they go about solving the problem of providing spatial information.
When we look at some of the logic that’s behind CNN’s, we begin to notice where it’s architecture fails.
Take a look at this picture.
It doesn’t look quite right for a face, even though it has all the necessary components to make up a face.
We know that this is not how faces are supposed to look, but because CNN’s only look for features in images, and don’t pay attention to their pose, it’s hard for them to notice a difference between that face and a real face.
How a CNN would classify this image.
How capsule networks solve this problem is by implementing groups of neurons that encode spatial information as well as the probability of an object being present.
The length of a capsule vector is the probability of the feature existing in the image and the direction of the vector would represent its pose information.
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part.
We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters.
— SourceIn computer graphics applications such as design and rendering, objects are often created by giving some sort of parameter which it will render from.
However, in capsules networks, it’s the opposite, where the network learns how to inversely render an image; looking at an image and trying to predict what the instantiation parameters for it are.
It learns how to predict this by trying to reproduce the object it thinks it detected and comparing it to the labelled example from the training data.
By doing this it gets better and better at predicting the instantiation parameters.
The Dynamic Routing Between Capsules paper by Geoffrey Hinton proposed the use of two loss functions as opposed to just one.
The main idea behind this is to create equivariance between capsules.
This means moving a feature around in an image will also change its vector representation in the capsules, but not the probability of it existing.
After lower level capsules detect features, this information is sent up towards higher level capsules that have a good fit with it.
How a Capsule Network would classify this face.
As seen in this picture, all of the pose parameters of the features are used to determine the final result.
Operations within a capsuleAs you may already know, a traditional neuron in a neural net performs the following scalar operations:Weighting of inputsSum of weighted inputsNonlinearityThese operations are slightly changed within capsules and are performed as follows:Matrix multiplication of input vectors with weight matrices.
This encodes really important spatial relationships between low-level features and high-level features within the image.
Weighting input vectors.
These weights decide which higher level capsule the current capsule will send it’s output to.
This is done through a process of dynamic routing, which I’ll talk more about soon.
Sum of weighted input vectors.
(Nothing special about this)Nonlinearity using “squash” function.
This function takes a vector and “squashes” it to have a maximum length of 1, and a minimum length of 0 while retaining its direction.
Dynamic Routing Between CapsulesProcess of Dynamic Routing.
In this process of routing, lower level capsules send its input to higher level capsules that “agree” with its input.
For each higher capsule that can be routed to, the lower capsule computes a prediction vector by multiplying its own output by a weight matrix.
If the prediction vector has a large scalar product with the output of a possible higher capsule, there is top-down feedback which has the effect of increasing the coupling coefficient for that high-level capsules and decreasing it for others.
Architecture of Capsule Network on MNISTCapsNet Architecture.
EncoderThe Encoder takes the image input and learns how to represent it as a 16-dimensional vector which contains all the information needed to essentially render the image.
Conv Layer — Detects features that are later analyzed by the capsules.
As proposed in the paper, contains 256 kernels of size 9x9x1.
Primary(Lower) Capsule Layer — This layer is the lower level capsule layer which I described previously.
It contains 32 different capsules and each capsule applies eighth 9x9x256 convolutional kernels to the output of the previous convolutional layer and produces a 4D vector output.
Digit(Higher) Capsule Layer — This layer is the higher level capsule layer which the Primary Capsules would route to(using dynamic routing).
This layer outputs 16D vectors that contain all the instantiation parameters required for rebuilding the object.
The decoder takes the 16D vector from the Digit Capsule and learns how to decode the instantiation parameters given into an image of the object it is detecting.
The decoder is used with a Euclidean distance loss function to determine how similar the reconstructed feature is compared to the actual feature that it is being trained from.
This makes sure that the Capsules only keep information that will benefit in recognizing digits inside its vectors.
The decoder is a really simple feed-forward neural net that is described below.
Fully Connected (Dense) Layer 1Fully Connected (Dense) Layer 2Fully Connected (Dense) Layer 3 — Final Output with 10 classesWhy don’t we use Capsule Networks?While CapsNet has achieved state of the art performance on simple datasets such as MNIST, it struggles on more complex data that might be found on datasets such as CIFAR-10 or Imagenet.
This is because of the excess amount of information that is found in images throw off the capsules.
Capsule nets are still in a research and development phase and not reliable enough to be used in commercial tasks as there are very few proven results with them.
However, the concept is sound and more progress in this area could lead to the standardization of Capsule Nets for deep learning image recognition.
If you enjoyed my article or learned something new, make sure to:Connect with me on LinkedIn.
Send me some feedback and comments (aryanmisra@outlook.
Check out the original paper in which this idea was proposed.
.. More details