Features that Maximizes Mutual Information, How do they Look Like?We can create latent features by maximizing mutual information, but how would they look like?Jae Duk SeoBlockedUnblockFollowFollowingFeb 26Image from pixabayI wish to thank Dr.

François Fleuret, Dr.

Devon Hjelm, and Dr.

Yoshua Bengio for the amazing paper and reference material.

Finally, I wish to thank my supervisor Dr.

Bruce for encouragement, patience and helpful discussions.

IntroductionThanks to recent developments, such as MINE and DeepInfoMax, we are not only able to estimate mutual information of two random variables that are high dimensional, but also able to create latent features that maximize mutual information with the input data.

Personally, I think the idea of creating meaningful latent variables somewhat relates to the talk Dr.

Bengio gave.

Simply put, better representation of data (a.

k.

a.

abstract representation) is beneficial for machine learning models.

This also relates to the idea in which, the objective function that we want to minimize/maximize is not in pixel space, rather in information space.

For this post, I am going to focus on visualizing the latent variable which was created by maximizing mutual information.

Rather than choosing the best ‘best’ representation that could be used for any downstream task.

MethodBlue Sphere → Input image from STL data set (96*96)Blue Rectangles → Encoder Network Green Rectangles → Local Information Maximizer Red Rectangles → Global Information MaximizerYellow Rectangles → Prior distribution as regularizationPlease note that the input image’s have been converted to grayscale hence does not have any color channel.

Furthermore, we can notice that we have three networks acting as an objective function, however, none of them are acting in pixel space, rather in information space.

(Please read the paper DeepinfoMax for further details).

Also, please note that all of the objective function networks takes in two variables to maximize the mutual information between them.

And all of our case we are giving those network the original image (resized if needed) and encoded latent variable.

Experiment Set UpCase A) Latent variable have smaller dimensionalityThis is the case which, the encoder is only made up of convolution layer without any padding.

Hence, after each convolution operation, the spatial dimension of the images will decrease.

Dimensionality : (96,96)→(94,94)→(92,92)→(90,90)→(88,88)Case B) Latent variable have larger dimensionalityThis is the case which, the encoder is made up of transpose convolution layers.

Hence, after each layer, the spatial dimension of the image will increase.

Dimensionality : (96,96)→(98,98)→(100,100)→(102,102)→(104,104)Case C) Latent variable have the same dimensionalityThis is the case which, we perform typical convolution operation with zero paddings so the spatial dimension does not change.

Dimensionality : (96,96)→(96,96)→(96,96)→(96,96)→(96,96)Case D) Latent variable have the same dimensionality (Reversed Autoencoder)This is the case which, we first increase the spatial dimension by transpose convolution but decrease it right away applying convolution without any padding.

(hence reversed autoencoder).

Dimensionality : (96,96)→(98,98)→(100,100)→(98,98)→(96,96)For all of the above methods, we will measure the mutual information between the original image and latent variable by creating a histogram of two images, more detail here or here.

Additionally, all of the hyper-parameters are kept the same such as learning rate, number of iteration and batch size.

ResultsWhen we compare the loss for 50 iterations, we can clearly see that mutual information is maximized when we keep the spatial dimensionality the same as the input data.

Let us compare the different features maps created by each method, in the order of case a,b,c, and d.

When we consider the average mutual information between the original image and 32 feature maps, we can see that mutual information is highest for case c.

Let us compare another image, this time image of a horse.

Similar results were obtained for an image of the horse as well.

Finally, when we compare the max mutual information of all 5000 images in STL data set, we can see that keeping the dimensionally same respect to the input data has the highest frequency of generating a latent variable that has high mutual information.

Conclusion / Interactive CodeTo access the code for Case a, please click here.

To access the code for Case b, please click here.

To access the code for Case c, please click here.

To access the code for Case d, please click here.

To access the code for creating visualizations please click here.

Please note that I have modified the original implementation from DuaneNielsen.

Final WordsWhy this matter to me?.I think this form of creating latent variable may be a good idea to overcome overfitting.

True, we can use generative models to create new data points that can be used as data augmentation but they come with their own set of problems such as mode collapse.

Having a diverse method of data augmentation could be beneficial for everybody.

Finally, all of the references have been linked here.

.