It turns out, yes.
Estimating model parameters does get a little tricky if latent features (or missing data) are involved.
Let’s see why.
The issueLet V be the set of observed variables, Z be the set of latent variables and θ be the set of model parameters.
If we consider the maximum likelihood approach for the parameter estimation, our objective1 function would be:equation 5Comparing equation 5 with equation 2, we can see that in the latter case, the parameters are coupled because of summation inside the log.
This makes the optimization using Gradient Ascent (or any iterative optimization technique in general) intractable.
This means that many realistic scenarios need a more powerful technique to infer the parameters.
EM Algorithm to the RescueThankfully, researchers already came up with such a powerful technique and it is known as the Expectation-Maximization (EM) algorithm.
It uses the fact that optimization of complete data log-likelihood P(V, Z | θ)* is much easier when we know the value of Z (thus, removing the summation from inside the log).
* Going forward, we consider Y as part of V for notation simplicity.
However, since the only way to know the value of Z is through posterior P(Z | V, θ), we instead consider the expected value of complete data log likelihood under the posterior distribution of latent variables.
This step of finding the expectation is called the E-step.
In the subsequent M-step, we maximize this expectation to optimize θ.
Formally, the EM algorithm can be written as:1.
Choose initial setting for the parameters θ^old2.
E Step Evaluate P(Z | V, θ^old)3.
M step Evaluate θ^new given by4.
Check for convergence of log likelihood or parameter values.
If not converged, then θ^old=θ^new and return to the E step.
Diving Deeper into EMBefore diving deep, first, we will derive a property that will come in handy while explaining the E and M steps.
Let us consider a distribution q(Z) over the latent variables.
Independent of choice of q(Z), we can decompose the likelihood of the observed data in the following fashion:equation 6* In the first step, we apply the concept of marginalisation of probability and Bayes’ Theorem respectively.
* In the second step, we multiply and divide by q(Z) inside the log and apply the “multiplication of terms in the log is equivalent to the summation of the log of terms” property respectively.
The second term in equation 6 is a well-known distance measure between two distributions known as Kullback-Leibler divergence.
At this point, we should carefully study the form of equation 6.
The first term contains joint distribution over V and Z whereas the second term contains the conditional distribution of Z given V.
One of the properties of KL divergence is that it’s always non-negative.
Using this property in equation 6, we can deduce that L(q, θ) ≤ ln p(V | θ).
This means that L(q, θ) acts as a lower bound on the log-likelihood of the observed data.
This observation, very shortly, would help in demonstrating that the EM algorithm does indeed maximize the log likelihood.
E-stepSuppose the initial value of the parameter vector θ is θ^old — (Step 1).
Keeping in mind the relation given by equation 6, the E-step tries to maximize the lower bound L(q,θ^old) with respect to q while holding θ^old fixed.
The solution to this maximization problem is easily seen by noting that the value of ln p(V | θ) does not depend on q(Z) and so the largest value of L(q,θ^old) will occur when the KL divergence vanishes, or in other words when q(Z) is equal to the posterior distribution p(Z | V, θ^old) (since ln 1 evaluates to 0).
Thus, E-step involves evaluating p(Z | V, θ^old) — (Step 2)Illustration of E-step.
M-stepIn this step, the distribution q(Z) is held fixed.
If we substitute q(Z) = p(Z | V, θ^old) from E-step into the expression of L(q, θ) (equation 6), we see that the lower bound takes the following form:equation 7where the constant is simply the negative entropy of the q distribution and is therefore independent of θ.
So, in the M-step, we maximize the lower bound L(q,θ) with respect to θ to give some new value θ^new.
This will cause the lower bound L(q,θ) to increase, which will necessarily cause the corresponding log-likelihood function to increase.
— (Step 3)Illustration of M-step.
Since the distribution q is held fixed during the M-step, it will not equal the new posterior distribution p(Z | V, θ^new), and hence there will be a non-zero KL divergence.
So, we repeat the E and M steps again until convergence.
— (Step 4)Putting it all togetherI know it’s a lot to digest at once.
So, I’ll try to summarize the discussion with the help of the following figure that should help in connecting the dots.
How E and M step maximize the likelihood.
The red curve depicts the incomplete data log likelihood, ln p(V | θ), which we want to maximize.
We start with some initial parameter value θ^old.
In the first E-step, we evaluate the posterior distribution over latent variables, p(Z | V, θ^old), which gives rise to a lower bound L(q,θ^old) whose value equals the log likelihood at θ^old, as shown by the blue curve.
In the M-step, the bound is maximized giving the value θ^new, which gives a larger value of log-likelihood than θ^old.
The subsequent E-step then constructs a bound that is tangential at θ^new as shown by the green curve.
At each step, we see that the obtained parameters increase the log-likelihood and the process continues until convergence.
Concluding remarksThis concludes a rather intense mathematical post on the EM algorithm.
However, understanding this algorithm in-depth is a really good investment of time and it really proves helpful if you want to work on developing your own graphical models to solve challenging problems.
If you are looking for an example problem and code, you can refer this GitHub repo of mine where we tackled the challenge of jointly modeling aspects, ratings, and sentiment with temporal dynamics for rating prediction task.
References Nasrabadi NM.
Pattern recognition and machine learning.
Journal of electronic imaging.