Now we are finally ready to place some of the puzzle pieces together.
The Siamese GANAs we previously stated, a Generative Adversarial Network can be thought as a Generator and a Discriminator working together to generate realistic images from a particular collection of images, or domain.
The Generator takes a random noise vector as input and “decodes” it to an image, while the Discriminator takes an image as input and outputs a score relative to how realistic the image looks.
Now let’s try and use a Siamese Network as the Discriminator.
What we now have is a Decoder-Encoder architecture, that takes a vector as input and has a vector as output.
This structure is similar to that of an AutoEncoder (Encoder-Decoder), but with the two components swapped.
Siamese GAN ArchitectureHow can this network possibly train?In the case of a Siamese Network trained to recognize faces, the number of total classes we have is the number of different faces our algorithm must recognize.
Thus in that case, we expect the network to organize the latent space (the output of the network) in such a way that all the vectors encoding the same face are close together, while being far away from all the others.
In the case of GANs however, the total number of classes is 2: fake images created by the Generator and real images.
Then our new Discriminator objective is to arrange the output vectors of the Siamese Network such as real images are encoded close to one another, all while keeping fake images far from them.
The Generator on the other hand tries to minimize the distance between real and fake image vectors, or in other words, wants real and fake encoded as the same class.
This new objective reproduces a very similar adversarial behaviour as in the “traditional” case, making use of a different kind of adversarial loss function.
Now that we understand the basics of the idea, let’s try to iterate on it and improve it.
In our loss function, we considered the distance between vectors.
But what is distance, really?.In our case, we evaluated distances between two vectors that could move iteration after iteration in the vector space being output by the Discriminator.
Considering the “relativity” of Distance, we can make much more robust measurements calculating distances from a fixed point in space.
This problem is taken into account by the Triplet Loss for the Siamese Network, which evaluates distances from a neutral (anchor) point.
Triplet LossHere d stands for squared euclidean distance, a is the “anchor” point (we will consider it fixed in space), n is the negative point and p is the positive one.
In our case, only having to deal with two classes (with the end goal of making one class undistinguishable from the other) we choose a fixed point in space before training and use it as our neutral point.
In my testing I used the origin point (the vector whose values are all zeros).
To reach a better understanding of our latent space and how we want to organize it for our objective, let’s visualize it.
Remember that we are projecting our space with a number of dimensions equal to VecLen to a 2D plane.
Initial ConditionAt the beginning of training, our images B and G(z) (the images generated by the Generator from noise vector z) are randomly encoded by the Discriminator in our vector space.
Vector Space during TrainingDuring Training, the Discriminator pushes the vectors of B closer to the fixed point, while trying to keep the encodings of G(z) at an arbitrary distance (Margin) from the point.
The Generator on the other hand wants G(z) vectors to be closer to the fixed point and to B vectors as a consequence.
Finally, here are some results from the Siamese GAN.
Random Flower samplesNow, to really understand why the Siamese GAN is extremely similar to a traditional GAN, we need to consider an edge case scenario: what if the Siamese Discriminator outputs a 1-Dimensional vector (VecLen=1)?.Now we have the traditional GAN Discriminator outputting a single value: if this scalar is close to a fixed number (our 1-Dimensional point), let’s say 1, the image looks realistic, while looking fake in the opposite case.
This is equivalent to keeping the score close to 1 for real and close to 0 for fake.
Thus, the loss now becomes the Squared Error, typical of LSGANs (Least Squares GANs).
So, nothing new here.
Well, not really.
Encoding an image to a latent vector can sometimes be quite useful.
Let’s talk about one practical example.
Image-to-Image TranslationA recent paper introduced TraVeLGAN, a new approach to the problem of unpaired image-to-image translation.
Unlike other methods (CycleGAN for example) TraVeLGAN doesn’t rely on pixel-per-pixel difference between images (it doesn’t use any cycle consistency constraint), resulting in image translation between wildly different domains, with hardly anything in common.
To achieve that, a traditional Generator-Discriminator architecture is used together with a separate Siamese Network.
TraVeLGAN ArchitectureLet’s say we must turn images from domain A to images belonging to domain B.
We call translated images by the Generator as G(A).
Then the Siamese Network encodes images in latent space and aims at reducing distances between transformation vectors of image pairs.
With S(X) as the vector encoding of X and A1, A2 two images from domain A, the Network must encode vectors such as:S(A1-A2) similar to S(G(A1)-G(A2))where a similarity metric such as Cosine Distance is used.
Doing that, the Siamese Network passes information (in the form of gradient) to the generator on how to preserve the “content” of the original images in the generated ones.
All of this happens while the Discriminator tells the Generator how to create more realistic images that resemble the ones from domain B.
The end result is a Generator that generates images in the style of domain B with somewhat preserved content from domain A (in the case of two completely unrelated domains, some sort of correspondence is maintained).
After this brief introduction (read the paper for more info!), how can we use our Siamese Discriminator with the TraVeLGAN approach?By simply removing the ad-hoc Siamese Network and making the already-in-use Discriminator output a vector, we can apply the previously discussed loss function to tell the Generator how realistic its generated images are, plus we are able to compute the distances between transformation vectors of image pairs in latent space using Cosine Distance.
Final Architecture with Siamese DiscriminatorSumming everything up, the Discriminator encodes images into vectors such as:1.
Images with lower Euclidean Distances from our fixed point (origin) have a more realistic Style2.
Transformation vectors of encoded Image pairs (A1- A2), (G(A1)-G(A2)) have low Cosine Distances from one another, preserving ContentKnowing that Angle and Magnitude of vectors are independent features, the Discriminator is able to learn a vector space applying these two constraints.
In my testing I used U-Net with skip connections as the Generator and a traditional fully convolutional Siamese Network as the Discriminator.
Furthermore, I used Attention in both the Generator and Discriminator, together with Spectral Normalization on the convolutional kernels to keep the training stable.
During training, TTUR (different learning rates for Discriminator and Generator) was used.
Here are some results, trained on apples and oranges images from ImageNet:Apple to Orange Image TranslationHere’s a high definition sample (Landscape to Japanese Print (ukiyo e)):Landscape to Ukiyo eConclusionEncoding images in latent space is very useful: we have shown that making the Discriminator output a vector instead of a single value, and changing the loss function accordingly, can lead to a more flexible objective landscape.
A task like image-to-image translation can be accomplished using only a single Generator and Discriminator, without any added networks and without the cycle consistency constraint, which relies on pixel-per-pixel difference and can’t handle extremely visually different domains.
Lots of other applications are to be explored, like working with labelled images and more.
Thank you for your precious attention, have fun!.