I’ll try to keep the notation as clean as possible for better understanding of the derivations.
First, let’s suppose we want to know what is the probability that a data point xn comes from Gaussian k.
We can express this as:Which reads “given a data point x, what is the probability it came from Gaussian k?” In this case, z is a latent variable that takes only two possible values.
It is one when x came from Gaussian k, and zero otherwise.
Actually, we don’t get to see this z variable in reality, but knowing its probability of occurrence will be useful in helping us determine the Gaussian mixture parameters, as we discuss later.
Likewise, we can state the following:Which means that the overall probability of observing a point that comes from Gaussian k is actually equivalent to the mixing coefficient for that Gaussian.
This makes sense, because the bigger the Gaussian is, the higher we would expect this probability to be.
Now let z be the set of all possible latent variables z, hence:We know beforehand that each z occurs independently of others and that they can only take the value of one when k is equal to the cluster the point comes from.
Therefore:Now, what about finding the probability of observing our data given that it came from Gaussian k?.Turns out to be that it is actually the Gaussian function itself!.Following the same logic we used to define p(z), we can state:Ok, now you may be asking, why are we doing all this?.Remember our initial aim was to determine what the probability of z given our observation x?.Well, it turns out to be that the equations we have just derived, along with the Bayes rule, will help us determine this probability.
From the product rule of probabilities, we know thatHmm, it seems to be that now we are getting somewhere.
The operands on the right are what we have just found.
Perhaps some of you may be anticipating that we are going to use the Bayes rule to get the probability we eventually need.
However, first we will need p(xn), not p(xn, z).
So how do we get rid of z here?.Yes, you guessed it right.
Marginalization!.We just need to sum up the terms on z, henceThis is the equation that defines a Gaussian Mixture, and you can clearly see that it depends on all parameters that we mentioned previously!.To determine the optimal values for these we need to determine the maximum likelihood of the model.
We can find the likelihood as the joint probability of all observations xn, defined by:Like we did for the original Gaussian density function, let’s apply the log to each side of the equation:Great!.Now in order to find the optimal parameters for the Gaussian mixture, all we have to do is to derive this equation with respect to the parameters and we are done, right?.Wait!.Not so fast.
We have an issue here.
We can see that there is a logarithm that is affecting the second summation.
Calculating the derivative of this expression and then solving for the parameters is going to be very hard!What can we do?.Well, we need to use an iterative method to estimate the parameters.
But first, remember we were supposed to find the probability of z given x?. More details