Meta-learning of Adversarial Generative modelsSharmistha ChatterjeeBlockedUnblockFollowFollowingMay 29MotivationConvolutional neural networks have been successful in generating realistic human head images by training neural networks on a large dataset of images of a single person.
However, in many practical scenarios, such personalized talking head models need to be learnt from a few image views of a person, sometimes limited to a single image.
Due to this limitation Samsung AI Research with Skolkovo Institute of Science and Technology discovered the concept of few-shot capability which is capable of performing the following tasks, with its high capacity generators and discriminatorsPerforming lengthy meta-learning on a large dataset of videos.
Frame few- and one-shot learning of neural talking head models of previously unseen people as adversarial training problems.
Initialize the parameters of both the generator and the discriminator in a person-specific way.
Fast and efficient training on just a few images.
Ability to tune tens of millions of parameters.
Scalable to generate personalized talking head models of new people and even portrait paintings.
Insights of Talking head modelsThe system designed to generate personalized photorealistic talking head models that has practical applications for telepresence, video- conferencing, multi-player games and animation industry.
Synthesize plausible video-sequences of speech expressions and mimics of a particular individual.
Synthesizing photorealistic personalized head images given a set of face land-marks, to generate model animation.
However the challenges include:Human heads have high photometric, geometric and kinematic complexity.
This complexity ranges not only from modeling faces but also from modeling mouth cavity, hair, and garments.
Accuracy of the human visual system towards minor mistakes in the appearance modeling of human heads.
The existing challenges have been addressed by synthesizing articulated head sequences by warping a single or multiple static frames.
However they have limited capacity in handling amount of motion, head rotation, and disocclusion without noticeable artifact.
Thus it fails to achieve full control over the head rotation in the resulting video which does not result in a fully-fledged talking head systemThey also suffer from training large neural networks where both the generator and discriminator have tens of millions of parameters for each talking head.
They require several minutes-long video or a large dataset of photographs as well as hours of GPU training in order to create a new personalized talking head model.
System DesignFew-shot capability addresses the proposed problems by creating a high performing adversarial network system with the following actions:Creating talking head models from a handful of photographs (termed as few- shot learning) and with limited training time.
A reasonable result with a single photograph (one-shot learning) is obtained from the system.
Adding a few more photographs increases the fidelity of personalization.
The system model is created with deep ConvNets that synthesize video frames in a direct manner by a sequence of convolutional operations rather than by warping (transforming an image or filter through rotation or scaling) with a single or multiple static frames.
Ability to handle a large variety of poses that goes beyond the abilities of warping-based systems.
The system is trained to use few-shot learning through extensive pre-training (meta-learning) on a large corpus of talking head videos corresponding to different speakers with diverse appearance.
The system during its course of course of meta-learning simulates few-shot learning tasks and learns to transform landmark positions into realistically-looking personalized photographs with a small training set of images with this person.
The next involves feeding a handful of photographs of a new person with a new adversarial learning problem equipped with high-capacity generator and discriminator.
The generator and discriminator are pre-trained via meta-learning.
The system trained with an adversarial problem converges to a state that generates realistic and personalized images after a few training steps.
The following figure shows how landmark facial features are used to generate realistic-looking moving video many times from an input photograph.
ArchitectureThe foundation of the one-shot system design are taken from recent progress in generative modeling of images.
The architecture uses adversarial training and, more specifically, the ideas behind conditional discriminators, including projection discriminators.
The meta-learning phase uses the adaptive instance normalization mechanism, that has been useful in large-scale conditional generation tasks.
The model-agnostic meta-learner (MAML) uses meta-learning to obtain the initial state of an image classifier, allowing it to quickly converge to image classifiers of unseen classes, given few training samples.
It further uses adversarially-trained networks to generate additional examples for classes unseen at the meta-learning stage followed by successive training of the image generation models.
To summarize, the adversarial fine-tuning is combined with the meta-learning framework.
The former is applied after the obtain initial state of the generator is obtained followed by the application of discriminator networks via the meta-learning stage.
Meta Learning Architecture for few-shot learning with generative modelsThe above figure depicts a typical meta-learning architecture that involves a embedder network to map the head images (with estimated face land-marks) to the embedding vectors containing pose-independent information.
The generator network maps input face landmarks into output frames through a set of convolutional layers, which are modulated by the embedding vectors.
During meta-learning, a sets of frames from the same video is passed through the embedder, averaging the resulting embeddings and use them to predict adaptive parameters of the generator.
Then landmarks of a different frame is passed through the generator, to compare the resulting image with the ground truth.
The objective function includes perceptual and adversarial losses.
The adversarial losses is being implemented via a conditional projection discriminator.
In the meta-learning stage of one- few shot learning, the following three networks are trained:• The embedder E takes a video frame, an associated landmark image and maps these inputs into an N-dimensional vector.
Here, φ denotes network parameters that are learned in the meta-learning stage.
During meta-learning, the aim is to learn φ such that the N-dimensional vector contains video-specific information (such as the person’s identity) that is invariant to the pose and mimics in a particular frames.
•The generator G takes the landmark image for the video frame not seen by the embedder, the predicted video embedding and outputs a synthesized video frame.
The generator is trained to maximize the similarity between its outputs and the ground truth frames.
All parameters of the generator are split into two sets: the person-generic parameters ψ, and the person-specific parameters ψˆi .
During meta-learning, only ψ are trained directly, while ψˆi are predicted from the embedding using a trainable projection matrix.
•The discriminator D takes a video frame, an associated landmark image yi (t) and the index of the training sequence.
θ, W, w0 and b denote the learnable parameters associated with the discriminator.
The discriminator contains a ConvNet part that maps the input frame and the landmark image into an N-dimensional vector.
The discriminator predicts a single scalar (realism score) r, that indicates, whether the input frame is a real frame of the i-th video sequence and whether it matches the input pose, based on the output of its ConvNet part and the parameters W, w0, b.
MetricsFor the quantitative comparisons, all the models are fine-tuned on few-shot learning sets of size T for a person not seen during meta-learning (or pre-training) stage.
After the few-shot learning, the evaluation is performed on the hold-out part of the same sequence (self- reenactment scenario), ensuring the fine-tuning and the hold-out parts do not overlap.
Multiple comparison metrics are used to evaluate photo-realism and identity preservation of generated images.
Frechet-Inception Distance (FID) measuring perceptual realism.
Structured Similarity (SSIM) measuring low-level similarity to the ground truth images, andCosine similarity (CSIM) between embedding vectors of the state-of-the-art face recognition network for measuring identity mismatch.
ExamplesThe below figure depicts the results of the best models obtained on the VoxCeleb2 dataset.
It further shows number of training frames, equal to T (the leftmost column) and the example training frame in shown in the source column.
Next columns show ground truth image and the results for Ours-FF feed-forward model, Ours-FT model before and after fine-tuning.
While the feed-forward variant allows fast (real-time) few-shot learning of new avatars, fine-tuning ultimately provides better realism and fidelity.
Better realism with fine-tuning using few-shot learningThe below figure depicts one-shot learning applied to bring still photographs (taken from VoxCeleb2 dataset) to life.
The puppeteering results for one-shot models learned from photographs in the source column.
Puppeteering with few-shot learningComparative Results Few-shot Learning with Others:The following figure depicts comparative results on the VoxCeleb1 dataset.
with one- and few-shot learning performed on a video of a person not seen during meta-learning or pre-training.
The number of training frames is equal to T (the leftmost column).
One of the training frames is shown in the source column.
Next columns show ground truth image, taken from the test part of the video sequence, and the generated results of the compared methods.
ConclusionThe designed framework for meta-learning of adversarial generative models is able to train highly-realistic virtual talking heads in the form of deep generator networks.
Very limited photographs (even one) can be used to create a new model, while the model when trained with 32 or more images achieves perfect realism and personalization score.
The system has limitation that the methods employed are the mimic representation and lack landmark adaptation.
Using landmarks from a different person demonstrates noticeable personality mismatch.
The system needs introduction of landmark adaptation in case of “fake” puppeteering videos (video synthesis of a certain person based on the face landmark tracks of a different person) without any mismatch.
References:Few-Shot Adversarial Learning of Realistic Neural Talking Head Models : https://arxiv.