Disentanglement with Variational Autoencoder: A ReviewLearning of interpretable factorized representation has been around in machine learning for quite a time. But with the recent advancement in deep generative models like Variational Autoencoder (VAE), there has been an explosion in the interest for learning such disentangled representation. Since the objective of any generative model is essentially to capture underlying data generative factors, the disentangled representation would mean a single latent unit being sensitive to variations in single generative factors.Since vanilla VAE encourages the posterior distribution over the generative factors q(z|x) to be closer to the isotropic Gaussian N(0, I), it promotes disentanglement of the latent generative factors. This is due to the fact that the covariance ∑ of isotropic Gaussian is equal to an identity matrix I meaning all the dimensions are independent. In the ELBO, this is promoted by the second term:However, the learning pressure required for effective disentanglement may not be enough since in VAE we also want to properly autoencode (or reconstruct) our input signals and the reconstruction loss (the first term) may be too strong compared to the second term. Inspired by this, the [β-VAE] created a stronger constraint on the latent bottleneck via β>1 weight given to the second term. Their objective function, thus, looks like this:As a result of increased weight given to the second term, the reconstruction accuracy started to worsen. This gave rise to the important research question for many researchers: “how to achieve better disentanglement without losing the reconstruction abilities?” The path towards finding this answer was greatly assisted by the [surgery of ELBO], where the second term was decomposed as:Here, the first term is the index-code mutual information (MI) and the second term is the marginal KL to prior. This decomposition gave a perspective that it’s actually the second term that is more important towards learning disentangled representation and penalizing MI (more than the regular ELBO) might be the reason for poor reconstruction. Further, [InfoGAN] (not a VAE based model) maximized the same MI to achieve better disentanglement.With this rationale, this paper [link] added (-1)λ weighted KL(q(z)||p(z)) to the regular ELBO. However, since KL(q(z)||p(z)) already exist in the ELBO, they are actually minimizing (λ + 1) weighted KL(q(z)||p(z)) to encourage disentanglement. Note that [adversarialAE] also minimizes this KL (not the KL(q(z|x)||p(z))) using adversarial loss.Going deeper, the [TC-βVAE] further decomposes this marginal KL into total correlation (TC) (first term) and dimension-wise KL (second term):With this decomposition, they argue that TC (Watanabe 1960), a popular measure of dependence for multiple random variables, is the most important term for learning disentangled representation and hence penalized TC with some β weight, hence their overall objective looks like:Concurrently, [dFactorising] paper also acknowledges the importance of TC for disentanglement and augmented this term in the ELBO with some (-λ) weight. Again, since TC already exist in the ELBO, they are actually minimizing (λ + 1) weighted TC to encourage disentanglement.The fundamental challenge, however, lies in the estimation of q(z) (aggregated posterior distribution) which depends on the entire dataset (not just a mini-batch). This led all these works to take a different approach while estimating q(z) or any terms involving it. As an example, [dFactorising] used the density-ratio trick with a separate discriminator.Overall, I believe, disentanglement with VAE will get even more interesting in near future.P.S. Considering the response of this note, I plan to take different perspective on analyzing “disentanglement with VAE” on my next note.