in the first step which as it turns out is equivalent to the value we would get by minimising the KL divergence between the two distributions.
We then set the posterior equal to q (confusing notation I know but this is just ????) and maximise this function with respect to the parameters θ.
We can see from the graph as we iterate and perform these calculations we move towards the optimum (or at least a local optimum).
Note: Theta is a vector of all parameters, Source: Bayesian Methods for Machine LearningThe EM algorithm for GMMThe E-StepOk, now that we have visualised what the EM algorithm is doing I want to outline and explain the equations we need to calculate in the E-step and the M-step.
These will be really important when it comes time to write our code.
We can write the Gaussian Mixture distribution as a combination of Gaussians with weights equal to π as below.
Where K is the number of Gaussians we want to model.
Equation 2: Gaussian Mixture DistributionTaking the above results we can calculate the posterior distribution of the responsibilities that each Gaussian has for each data point using the formula below.
This equation is just Bayes rule where π is the prior weights and the likelihood is normal.
Equation 3: Posterior Responsibilities using Bayes RuleThe M-StepAfter calculating our posterior all we need to do is get an estimate of the parameters of each Gaussian defined by the equations below and then evaluate the log-likelihood.
These two steps are then repeated until convergence.
Equation 4: Mean of the GaussiansEquation 5: Covariance of the GaussiansEquation 6: weightsEquation 7: Sum of responsibilities in each Gaussian kEquation 8: Marginal Likelihood: This is what we want to maximiseRemember though, we have set the problem up in such a way that we can instead maximise a lower bound (or minimise the distance between the distributions) which will approximate equation 8 above.
We can write our lower bound as follows where z is our latent variable.
Notice our summation now appears outside the logarithm instead of inside it resulting in a much simpler expression than equation 8.
Equation 9: Variational Lower Bound, Source: Bishop equation 9.
74Python CodeNow that we have explained the theory behind the modelling I want to code up this algorithm using Python.
Like my previous post, I am going to be using the same data set so we can compare the results between k-means and GMM.
The preprocessing steps are exactly the same as those in the previous post and I provide a link to the full code at the end of this post.
K-means estimationAs I mentioned before, in order to start the algorithm (perform 1st E-step) we need initial values for our parameters.
Rather than just randomly setting these values it is usually a good idea to estimate them using k-means.
This will usually give us a good starting point and can help our model converge faster.
Before we estimate GMM let’s have a quick look at what kind of clusters k-means gives us.
sklearn k-meansUsing our estimates from sklearn we can create a nice visualisation of our clusters (Figure 2).
Notice the clusters are all spherical in shape and are the same size.
The spherical clusters do not seem to model the spread of the data very well indicating that k-means in this particular case may not be the best approach.
This illustrates one of the limitations of k-means as all covariance matrices are diagonal with unit variance.
This limitation means that the model is not particularly flexible.
With that in mind, let’s try out GMM and see what kind of results that gives us.
Source: Python for Data Science HandbookFigure 2: k-means spherical GaussiansGMM estimationFigure 3 below illustrates what GMM is doing.
It clearly shows three clusters modelled by three different Gaussian distributions.
I have used a toy data set here just to illustrate this clearly as it is less clear with the Enron data set.
As you can see, compared to Figure 2 modelled using spherical clusters, GMM is much more flexible allowing us to generate much better fitting distributions.
Figure 3: GMM example: simple data set: Full CovarianceGMM Python classOk, now we are going to get straight into coding our GMM class in Python.
As always, we start off with an init method.
The only things I am initialising here are the number of times we want to run our algorithm and the number of clusters we want to model.
The most interesting method in this code snippet is calculate_mean_covariance.
This helps us calculate values for our initial parameters.
It takes in our data as well as our predictions from k-means and calculates the weights, means and covariance matrices of each cluster.
The next bit of code implements our initialise_parameters method which uses k-means from the sklearn library to calculate our clusters.
Notice that this function actually calls our calculate_mean_covariance method defined above.
We could have probably used one method to calculate our clusters and initial parameters but it is usually much easier to debug and avoid errors if each method only carries out one specific task.
It’s time to get right into the most important methods in our class.
The E-step of the algorithm is defined below and takes in our parameters and data which makes perfect sense given the equations we defined above.
Remember, the purpose of this step is to calculate the posterior distribution of our responsibilities (????).
The main thing to note here is that we loop through each of the C Gaussian’s (3 in our case) and calculate the posterior using a function from scipy to calculate the multivariate normal pdf.
stats import multivariate_normal as mvnAfter we have calculated this value for each Gaussian we just need to normalise the gamma (????), corresponding to the denominator in equation 3.
This is to ensure our gammas are valid probabilities.
If we sum the values across clusters for each data point they should equal 1.
After we calculate the values for the responsibilities (????) we can feed these into the M-step.
Again the purpose of the M-step is to calculate our new parameter values using the results from the E-step corresponding to equations 4, 5 and 6.
To make debugging easier I have separated the m_step method and the compute_loss_function method in my code below.
The compute_loss_function does exactly what its name implies.
It takes in the responsibilities and parameters returned by the E-step and M-step and uses these to calculate our lower bound loss function defined in equation 9.
All of our most important methods have now been coded up.
Keeping consistent with sklearn I am going to define a fit method which will call the methods we just defined.
In particular, we start by initialising our parameter values.
After this, we perform the steps outlined in the EM-algorithm for our chosen number of iterations.
Note that it doesn't actually take a large number of iterations to converge particularly when you use k-means to get values of the initial parameters (I think my algorithm converged in about 30 iterations).
Since we are probably also interested in using this model to predict what Gaussian new data might belong to we can implement a predict and predict_proba method.
The predict_proba method will take in new data points and predict the responsibilities for each Gaussian.
In other words, the probability that this data point came from each distribution.
This is the essence of the soft assignment that I mentioned at the start of the post.
The predict method does essentially the same but assigns the cluster which has the highest probability using np.
Fitting our modelAfter that explanation, I think it’s about time we estimate our model and see what we get.
Hopefully, the GMM visualisation above provided a good intuition about what the model is doing.
We are going to be doing the exact same thing for our Enron data set.
The code below just estimates our GMM model on our dataset using 3 different Gaussians.
For plotting purposes, I also calculate the point of highest density of each distribution, corresponding to the centre which is helpful as a visualisation aid.
Finally, we also use the model parameters to draw the shape of each distribution in Figure 4.
The main takeaway in this figure is that the distributions are clearly no longer spherical.
GMM has allowed us the relax our restrictions on the covariance matrix allowing the distribution to have a much better fit to the data.
This is particularly useful given that the shape of our data was clearly not spherical.
Now, this is probably not a perfect solution and there are some data points which do not fit any distribution very well but it is an improvement over k-means.
Figure 4: GMM with Full covarianceGMM sklearn ImplementationNow, just to make sure we haven't done anything completely crazy in our code I am going to redo this estimation using sklearn and see if my solution is the same.
The code below is pretty much the exact same as the code above so I won't go through it in detail.
It looks like we have very similar result compared to sklearn.
The one difference is that one of our cluster centres appears to be different.
In the sklearn implementation, the centre is closer to 0.
4 while in our implementation it is closer to 0.
Perhaps this is due to a slightly different initialization in sklearn?sklearn GMMFigure 5: GMM sklearnAlright, guys, that’s it for this post.
I hope that was a useful and pretty intuitive explanation of Gaussian Mixture Modelling.
If any of you want to get a deeper understanding of the material I recommend the Coursera course Bayesian Methods for Machine Learning.
I sourced a lot of the material from this course and I think it gives really nice and in-depth explanations of the concepts I presented here.
I would also recommend the book, Pattern Recognition and Machine Learning by Bishop.
This book is a great reference for most of the classic algorithms you will come across in machine learning.
Below I provide the full code for the GMM class outlined in the post as well as a link to the Kaggle kernel where I did all the analysis.
As always, feedback is welcome.
Full Python Code for GMM classLink to full code: https://www.
com/dfoly1/gaussian-mixture-modelSource: Christopher M.
Bishop 2006, Pattern Recognition and Machine LearningSource: Bayesian Methods for Machine Learning: Coursera courseSource: Python for Data Science Handbook.