A new Tool to your Toolkit, KL Divergence at WorkThe finale, applying KL Divergence to real DatasetAbhishek MungoliBlockedUnblockFollowFollowingJun 15In my previous post, we got a thorough understanding of Entropy, Cross-Entropy, and KL-Divergence in an intuitive way and also by calculating their values through examples.

In case you missed it, please go through it once before proceeding to the finale.

In this post, we will apply these concepts and check the results in a real dataset.

Also, it will give us good intuition on how to use these concepts in modeling various day-to-day machine learning problems.

So, let’s get started.

1.

Analyzing DatasetThe Dataset consists of two latent features ‘f1’ and ‘f2’ and the class to which the data-point belongs to, i.

e.

the positive class or the negative class.

DatasetDataset VisualisationVisualizing the Data with a scatterplotCode used for VisualisationSo, we have data points having two latent features, ‘f1’ and ‘f2’.

The data points belong to ‘+’ class (red in color) and ‘-’ class (blue in color).

2.

Defining the PurposeOur purpose is to define an ideal distribution for both positive and negative class of the dataset.

As of now, we have no idea what will it look like, how it’s Probability Density Function will be, but we can define some good to have properties for it.

Properties of our Ideal DistributionThe distribution of positive class should be such that the probability of any data point belonging to the positive class should be 1 and negative class should be 0.

The distribution of negative class should be such that the probability of any data point belonging to the positive class should be 0 and negative class should be 1.

3.

How to estimate the above Ideal DistributionNow the fun begins.

How to estimate that ideal distribution.

It’s an open-ended question and many techniques can be tried upon.

But for this blog’s post, I will keep things simple and won’t deviate too much from the original topic, applying KL Divergence to day-to-day machine learning problems.

Gaussian/Normal Distribution to RescueOne way to estimate distribution is by using a Gaussian Distribution.

We will try to fit a Gaussian for positive class and another Gaussian for negative class.

There are available packages which will find us the appropriate parameters of these fitted Gaussians.

But, if you are interested in understanding, how it does that then you can read more about it here.

An algorithm called Expectation-Maximisation is used for it.

Maybe, I will write about it in some another blog post.

Let’s fit the distribution using GaussianMixture package available in python.

PDF of a multivariate Gaussian DistributionFitting the Distribution and visualizing the resultsFitted Gaussians for the positive and negative classVisually the distribution looks good in doing the task assigned.

One Gaussian is fitted to positive class and another to negative class.

Next, we will calculate for each data-point, it’s probability to belong to the positive and negative class distributions.

4.

Finding Probability for each Datapoint (Optional)In this part, we will see, how the final probability for each data point is calculated once the fitting of multivariate Gaussian for the positive and the negative class is done.

It will be a little more mathematics intensive and optional.

It can be used as black-box, to get the final probability.

But in case, you are interested in understanding the mathematics behind, you can follow the section else skip to the next one.

For any datapoint ‘x’, probability belonging to distribution is given byUsing the above formula we can find the likelihood,Probability of datapoint given + distributionProbability of datapoint given – distributionNext, we can find the class probabilities or the priors using,Probability of + distributionProbability of – distributionwhere n is the total number of data points.

Once we have the likelihoods and priors, the last step is to just find the posterior, i.

e.

probability of datapoint.

We can use Bayes theorem to calculate that.

Probability of datapoint belonging to the + distributionProbability of datapoint belonging to the – distributionWe can use the above posterior to find the probability of each data point belonging to +ve or -ve distribution.

5.

Evaluating the goodness of fitNow, once we have fitted the distribution and also calculated the probability of each data point belonging to the positive and negative distribution.

We can see how much this fitted distribution differs from our ideal distribution.

How can we check that?.Of course, using our favorite metric, KL divergence (Kullback–Leibler divergence).

Just to reiterate, KL Divergence is just the difference between a fitted distribution and actual distribution, i.

e.

the difference between cross-entropy and entropy.

It can also be looked as to how much the two distributions differ.

Calculating KL DivergenceKL Divergencewhere H(p,q) is the cross-entropy and H(p) is the entropy of the system, where pᵢ is the actual probability of the i-th event and qᵢ is the estimated probability of the i-th event.

Reiterating the properties of our Ideal DistributionThe distribution of positive class should be such that the probability of any data point belonging to the positive class should be 1 and negative class should be 0.

The distribution of negative class should be such that the probability of any data point belonging to the positive class should be 0 and negative class should be 1.

pᵢ is the actual probability of the event which is coming from the properties of ideal distribution.

qᵢ is the estimated probability of the event, calculated using the fitted/estimated distribution.

We use these probabilities to find the KL Divergence.

Calculating the probabilities and KL DivergenceThe KL divergence comes out to be 5.

74, which denotes the fitted distribution is pretty close to the ideal.

But can we do better?6.

An effort to get closer to the ideal distributionOne Gaussian curve per class may not be enough to mimic the whole distribution.

We can fit a mixture of Gaussians and see the results.

How many Gaussians?.Till our KL divergence approaches to 0, i.

e no or minimal difference between ideal and fitted distribution.

Let’s try that.

Try fitting more than one Gaussian per ClassResultsKL Divergence on increasing the number of Gaussians per Class1 Gaussian per Class, KL Divergence = 5.

742 Gaussian per Class, KL Divergence = 3.

183 Gaussian per Class, KL Divergence = 1.

814 Gaussian per Class, KL Divergence = 0.

775 Gaussian per Class, KL Divergence = 0.

20TakeawaysFour Gaussians per class are enough and closely mimic the ideal distribution with almost 0 KL-Divergence.

The below plot also makes that clear.

4 Gaussian per Class, KL Divergence approaches 07.

ConclusionWe took a proper dataset with two different classes.

We wanted to find the underlying distribution for the two classes.

So, we first defined what is good to have properties for an ideal distribution and were able to mimic that ideal distribution very closely.

In this way, we can always try to find the underlying distribution for data and see the goodness of fit using KL Divergence.

Hope it brings the required clarity for the topic and arises new horizons for its application to your day-to-day machine learning works.

It takes a lot of efforts writing a good post with clarity and easy understandability for the audience.

I will keep trying to do justice with my work.

Follow me up at Medium and check out my previous posts.

I welcome feedback and constructive criticism.

8.

Referenceshttps://en.

wikipedia.

org/wiki/Probability_density_functionhttps://en.

wikipedia.

org/wiki/Entropyhttps://en.

wikipedia.

org/wiki/Cross_entropyhttps://en.

wikipedia.

org/wiki/Kullback%E2%80%93Leibler_divergencehttps://scikit-learn.

org/stable/modules/mixture.

htmlhttps://towardsdatascience.

com/part-i-a-new-tool-to-your-toolkit-kl-divergence-5b887b5b420ehttps://towardsdatascience.

com/demystifying-entropy-f2c3221e2550https://towardsdatascience.

com/demystifying-cross-entropy-e80e3ad54a8http://www.

aishack.

in/tutorials/expectation-maximization-gaussian-mixture-model-mixtures/https://en.

wikipedia.

org/wiki/Expectation%E2%80%93maximization_algorithmhttps://brilliant.

org/wiki/gaussian-mixture-model/https://en.

wikipedia.

org/wiki/Bayes%27_theoremhttps://en.

wikipedia.

org/wiki/Kullback%E2%80%93Leibler_divergence.