No, right.

We need a feature representation of each object present in the image that separates it from other objects and brings similar objects closer, this is where encoder comes into the picture.

It converts the color image representation to some latent space representation in which the features of the same object are closer and that of distinct objects are far away from each other.

Obviously, the encoder needs to be trained.

Once we have features in some latent space which separates distinct objects we need to propagate the information to pixel level and this where decoder comes into play.

Let us understand how graphs can be useful in semantic segmentation.

Graphs come into play once the feature vectors are extracted using an encoder.

Consider the images in Fig2.

Images are in order actual image, ground truth, segmentation using FCN, segmentation using GCU respectively from top to bottom.

In this image, the probability of pillows and bed class comes out to be very close when FCN is used.

Hence you can see in Fig2 pillow and bed are merged.

Now suppose a graphical representation follows an encoder.

The feature vector representation of the image that results from the encoder is forced to divide into as many as 8 regions (i.

e.

graph of 8 vertices) ] if I say in terms of loss the loss is higher if a vertex is empty i.

e.

no pixels are assigned to it.

In the initial stages of training in this case also pillow and bed will be assigned to the same class.

But as we train the network further, each feature vector is assigned to a vertex i.

e.

no vertex is empty.

The probable regions in which the image will be segmented in graph representation are the regions which are in the ground truth.

Due to which the pillow and bed will be segmented into two different regions by the end of the training.

The features assigned to a vertex are processed further by multiplying with some weights and the resulting feature vector along with the feature vector from the encoder are further used for segmentation.

Thus the graphical representation further improved the feature representation of the image.

One obvious thing that can happen is over-segmentation of the image when dividing it into regions.

But the oversegmented regions will recombine after further operations like convolutions.

Now the question is how to convert the grid-based representation of the image to this graphical representation and learn the parameters of graphical representation and the answer is graph convolution units(GCU).

Fig2Graph Convolution Units (GCU)Just like convolution operate on grid-like structures GCU operates on a graph-like structure.

There are 3 major steps in GCUGraph Projection: In this step, the grid-like representation of an image is converted to the graphical representation.

The graph is parameterized by the following parameters:V: Number of the vertex in the graph which implies the number of regions in which the image will be segmented.

W: Feature vectors which represent the region.

The shape is (d, V), where d is the dimension of feature vectorsVariance: It is the variance along each dimension across all the pixels assigned to a particular vertex.

The shape is (d, V), where d is the dimension of the feature vector.

V is a fixed and W and variance are learned during training.

Suppose there is a 2-D feature map of the image with height H and width W and each element has a dimension d.

Probability of each feature vector belonging to each vertex is calculated, which results in a probability matrix Q.

The equation below is used to calculate probabilities:where xᵢⱼ is the feature vector of the iᵗʰ row and jᵗʰ column of the 2-D feature map, wₖ is feature representing kᵗʰ region (vertex) and σₖ is variance along all dimensions of vertex k.

Now, feature encoding of all the vertices is calculated by taking the weighted average of residuals.

More is the residual less is its contribution in calculating the encoded feature.

Equations given below are used:where zₖ is the encoded feature.

The adjacency matrix is calculated by ZᵀZ which gives cosine similarity between different vertex.

In total 3 things are calculated in this stepProbability matrix Q, shape ( HW, d)Encoded feature Z, shape ( d, V)Adjacency matrix A, shape ( V, V)- A representation of similarity between the regions thus it captures the long-range dependencies in an image.

Overall pipeline for GCU2.

Graph Convolution: This step is analogous to forward step of convolution i.

e.

convolution is done on the generated graph.

The equation given below is used:where W g is the weight matrix of shape (d, dₒᵤₜ).

As you can see there is an Adjacency matrix, A, in the equation, long-range dependencies are considered when calculating new encoded features.

So the new encoded features depend on all the regions(A)and the current encoded features (Z).

For reading about graph convolution refer to this article and this.

A very easy to understand and just enough information is given in these articles.

3.

Graph Re-projection: Finally, the graph is converted back to the grid-like structure to visualize or do further operations.

The equation is given belowArchitecture and ImplementationThe architecture used is pretrained ResNet 50/101, dilations are added to the last two layers thus output is downsampled by 8.

It is followed by GCU.

In the original implementation, outputs of 4 GCUs are concatenated to the output of ResNet as shown in the diagram below.

d is 1024 and dₒᵤₜ is 256 in this case.

The depth of the output after concatenating will be 1024( from ResNet50) + 256×4 = 2048.

The concatenated output is upsampled using bilinear interpolation.

Next, convolution layers are used to assign pixels to different classes.

The loss function used to minimize the error is Negative log-likelihood loss function.

Fig3 Training loss curvePytorch implementation by me is available here.

I have given the implementation details below.

Dataset used is ADE20K.

ResNet50 dilated is used, pretrained on ADE20K available here.

The depth of output of ResNet50 is 2048.

GCUs follow the ResNet50.

In the paper, 4 GCU units were concatenated but I have used only 1 GCU with 16 vertices due to limited computing power.

I have written a generalized code so you can easily modify the code for 4 GCUs.

For knowing in more detail about GCU implementation refer to my next post.

d is 2048 and dₒᵤₜ is 256 in this case.

The depth of the output after concatenating will be 2048( from ResNet50) + 256= 2304It is followed by bilinear upsampling operation and then 1 convolution layer.

Images are resized to the size of 512×512 before feeding to the network.

Due to limited computation, I have used the batch size of 1 and trained for 120 epochs each having 1000 iterations.

SGD was used with a momentum of 0.

9.

The learning rate starts with 0.

01 and it decays as the training proceedsCurrently, the model is using 2 GPUs.

One GPU is dedicated to ResNet and other is for all the other computations like GCU, upsampling and convolutions.

Training loss plot is shown in Fig3.

After training with given hyperparameters, training accuracy was about 78%.

This post is based on the paper by Yin Li and Abhinav Gupta.

Any doubts or suggestions feel free to ping me ????.

Also, find me on Twitter and Linkedin.

Adios!!.