Tensorflow time series uses a mean-field variational family for q(z).
A mean-field family is a restriction on the relationship among the random variables in z — it assumes that all the variables are independent to each other.
The benefit of such restriction is that the joint probability density q(z) equals the product of the individual variable probability densities.
To fully specify q(z), we also need to choose the probability density for each of the variables in z.
For that, Tensorflow time series chooses Gaussian distribution for each random variable in z.
In formula:where N(σ_level; μ_l, σ_l²) at line 3 stands for the Gaussian probability density function for the variable σ_level, with mean μ_l and standard deviation σ_l.
Likewise for the model parameter σ_slope and σ_obs.
Line 4 and 5 show that q(z) has its own parameters vp=[μ_l, μ_s, μ_o, σ_l, σ_s, σ_o], a vector of length 6.
These are parameters that specify the three independent Gaussian distributions.
We call these six parameters variational parameters to distinguish them from the parameters of the local linear trend model, which we introduced before as model parameters.
Line 5 shows that q(z) takes z and vp as its arguments.
q(z; vp), with the semicolon inside, means that a probability density for the variable z.
And this density has parameters vp to control its shape.
Note that at line 1, q(z) is also a function of both z and vp.
This doesn’t change from line 1 to 5.
We wrote it as q(z) without mentioning vp explicitly just to highlight that q(z) is a probability density function for the random variables in z.
Please note the relationship between the model parameter z and the variational parameter vp:z is the parameter of our local linear trend model.
What value z actually takes controls what kinds of time series we can model.
Remind yourself of the different sample time series we saw at the beginning of this article.
vp is the parameter of the probability density function of z.
z is modelled as random variables, and the variational parameter vp controls the shape of q(z), the probability density function of z.
Going back to Listing 1, line 14 constructs the variational distribution q(z).
Since q(z) has a mean-field structure, the qs variable in Python code is a dictionary, with each key-value pair being a model parameter and its corresponding variational distribution.
Let’s check if this q(z) satisfy our requirements: q(z) has a simple structure and analytical form.
Also, q(z) has the same set of variables as p(z|y).
However, those variables in q(z) have different domains from the variables in p(z|y).
The issue with random variable domain mismatchVariables in q(z) comes from Gaussian distributions, so they have full real number domain, denoted by ℝ.
How about the domains of variables from the posterior p(z|y)?.The domains of variables from the posterior p(z|y) are the same as their domains in the prior p(z).
p(z) is a product of LogNormal density functions, whose domain is positive numbers ℝ⁺.
There is a domain mismatch between variables in q(z) and variables in p(z|y).
This domain mismatch violates the second requirement above for q(z).
To solve this problem, we transform the posterior p(z|y) from a probability density function of variables over the ℝ⁺ domain into another probability density function p(u|y) of variable vector u over the ℝ domain.
You may ask, we don’t know the analytical form of the posterior p(z|y), how can we perform this transformation?.Well, we know the structure of the posterior:where m stands for the inverse of the marginal likelihood:m is a constant that does not involve z since z has been integrated out.
When observations Y are plugged in, the posterior becomes a function that takes z as its single argument.
This is because it is the product of a constant and two functions that both takes z as their only argument.
The blue line above shows this structure clearly.
We also know the analytical form for both p(y|z) and p(z), so we can transform their product, and the constant m just scales the transformed result.
You may also ask, why don’t we transform q(z) to another distribution over the positive real domain ℝ⁺ instead?.Because this makes q(z) more complicated by adding constraints to it.
We want a simple q(z), that’s the point of variational inference.
We use the technique of probability density transformation.
Probability density transformationLet’s re-state out current goal.
Sometimes during a long process, it is easy to forget what we want to do.
To recap, we have the posterior p(z|y) = m×p(y|z)p(z) with variables z over positive ℝ⁺ domain.
We want to transform it to another distribution p(u|y) with each variable in u coming from the full ℝ domain.
This way, the posterior density and the variational distribution density will have the same domain over their input variables.
What will u look like?.Since the three variables σ_level, σ_slope and σ_obs in z are independent, we can transform them independently.
So it makes sense to make u contain three variables [u_level, u_slope, u_obs].
Each variable in u is responsible for the corresponding variable in z.
What will the probability density p(u|y) look like?.Let’s first focus on a single variable pair σ_level ∊ ℝ⁺ and u_level ∊ ℝ.
We want to write σ_level as a function of u_level.
This function has input domain ℝ and output range ℝ⁺.
Many functions do that, such as:The following figure shows the curve of the exponential function in green and the curve of the square function in red.
Both functions map a value from ℝ to ℝ⁺.
Between the two, we want to choose the exponential function.
This is because the exponential function is monotonically increasing.
So there is a one-to-one mapping between σ_level and u_level.
Later we will need this one-to-one mapping property.
So we decided to use exponential as the transformation function from variables in u to variables in z:This is the textbook case of replacing variables with functions of other variables.
Now we need to find out how does this variable replacement change our posterior probability density p(z|y) = m×p(y|z)p(z).
To do that, we look at the definition of probability density function — — a function that is non-negative and integrates to 1 over the domain of the variables.
So our posterior density actually means:The first line is the definition of probability density function.
The second line explicitly shows the variables and the integration limits.
Let’s look at the inner most integration:In this integral, we want to replace all occurrences of σ_level with the exponential function exp(u_level).
In Calculus, the rule of integration by substitution tells us how to do this:The first line is the original integration.
At the second line, all the variables that are irrelevant to the current integration are replaced with a dot “·” to simplify the formula.
The third line is the crucial step.
It applies the rule of integration by substitution: (a) In the original function to be integrated, replace all occurrences of the old variable with the transformation function.
(b) Multiply the original function with the derivative of the transformation function with respect to the new variable.
(c) Change the limits of the integration.
The log function is the inverse of the exponential function that we use as our transformation function.
We use this inverse function to change the original integration limits to the new limits.
This is why we want the transformation function to be monotonic, so its inverse is also a function.
The fourth line reduces the new limits and the derivative.
Now we work our way out for the triple integrations, and arrive at this formula:The first line performs three integration by substitutions for the three variables.
The result integration has limits from -∞ to +∞.
In the second line, the mysterious term |det J(u)| is due to the application of the rule of integration by substitution.
Inside the absolute operator |.
|, det stands for the determinant of a matrix.
And the matrix J(u) is called the Jacobian.
For this article, you only need to know that |det J(u)| equals e to the power of u_level + u_slope + u_obs in our case.
And it is there due to the application of the rule of integration by substitution.
This paper explains the Jacobian in detail.
The formula at the second line is the definition of the transformed probability density function of our posterior.
That is, p(u|y) = m p(y|u) p(u) for the set of variable u over the full real domain ℝ.
From now on, we should use p(y|u)p(u) instead of p(y|z)p(z).
However, to make things simpler, let’s redefine our z to u.
This means from now on:use σ_level, σ_slope, σ_obs to represent u_level, u_slope and u_obs.
In other words, z means u.
use p(y|z)p(z) to represent p(y|u)p(u).
q(z) stays the same.
This way, all the above formula stay the same, you just need to remember now things are in the transformed domain.
Measuring similarity between two distributionThe third requirement for q(z) listed above requires that q(z) should be as close to p(z|y) as possible.
Graphically, this means the curve of q(z) should overlap with the curve of p(z|y) as much as possible.
The following figure demonstrates this interesting idea.
I wrote this code to generate the figure.
The blue curve represents the posterior p(z|y) that we want to compute.
Its shape is irregular — notice the left side is larger than the right side.
Of course, we don’t know the shape of p(z|y), I just use an irregular shape to convey the idea that p(z|y) is something complicated.
The green curve represents our proposed variational distribution q(z).
I use a regular shape to indicate that q(z) is structurally simple.
The shape of q(z) is controlled by the six variational parameters vp.
We need to adjust these six values to make the green curve overlaps with the blue curve as much as possible.
Remember, ultimately we want to know the values of the model parameters z.
To do that, we need to determine the shape of the variational distribution q(z) by finding out the values of the variational parameters vp.
We are looking for the values for vp such that q(z) overlaps with the posterior p(z|y) as much as possible.
Once we have the shape of q(z), we can sample from it to get the actual values for z.
To quantify the level of overlapping between two probability densities, we use the Kullback–Leibler divergence.
The KL-divergence between the probability density q(z) and the probability density p(z|y) is:The intuition behind KL-divergence is straightforward.
It is a weighted average (weighted by q(z)) of the differences between the two probability density at each point in the domain of z.
We want to find a particular values for vp such that q(z) minimises the distance to the posterior p(z|y).
We can express this minimisation in the following formula:The “star” in q(z; vp)* denotes the optimal value for q(z).
This means we want to find values of those six variational parameters so the KL-divergence between q(z) and p(z|y) is the smallest.
See how variational inference converts an inference problem into an optimisation problem!.The original problem of using the Bayes rule to compute the posterior p(z|y) is usually referred to as an inference problem.
On the other hand, minimising the KL-divergence between q(z) and p(z|y) is an optimisation problem.
The KL-divergence is the optimisation objective.
It is quite surprising that we can minimise the KL between q(z) and p(z|y) even when we don’t know how to compute p(z|y).
The secret lies on the theory of variational inference.
It tells us that minimising the KL-divergence is equivalent to maximising another formula called Evidence Lower Bound, denoted by ELBO(q(z)).
If you are not familiar with why, don’t worry.
For now, it is enough to accept that these two optimisation problems are equivalent.
ELBO(q(z)) is defined as:The intuition behind ELBO is best seen from the second line above.
It has two terms: one expectation minus another expectation.
And we want to maximise the result of the subtraction.
The first term is the expectation of the log likelihood.
It describes how well our model fit the data.
Why?.Because to make this term large, our q(z) should give large probability to values of z when p(y|z) is also large.
We want this term to be large.
The second term is the KL-divergence (in case you haven’t realised) between q(z) and the prior p(z).
We know that a KL-divergence is non-negative.
This term describes how different our proposed q(z) is from our prior assumption p(z).
We want the difference to be small because we assume our prior makes sense.
So we want the second term to be small.
ELBO conveys our wish that our optimised model fits the data well, and at the same time it stays close to our model assumption, the prior.
The fifth and sixth line show we expand ELBO(q(z)), an expectation, to an integration.
Line 7 shows the function inside the integral starts as a function that takes two arguments z and vp (with observations Y plugged in).
This is because it consists of the functions are either function of z or functions of z and vp.
Line 8 shows that after integrating z out, the ELBO is a function of the variational parameters vp only.
You may ask, why we need ELBO(q(z)) at all?.Why can’t we minimise KL(q(z)||p(z|y)) directly?.To answer this question, we need to notice:The KL formula mentions the posterior p(z|y) which we cannot compute.
But the ELBO(q(z)) formula does not mention the posterior p(z|y).
It only mentions the prior p(z), the likelihood p(y|z) and the variational distribution q(z).
And we know the analytical form of all these three quantities.
This is why we need the ELBO and how it is possible to minimise the KL-divergence between q(z) and p(z|y) even when we don’t know p(z|y).
This fact is quite surprising at first.
But it is reasonable.
Because the likelihood p(y|z) and the prior p(z) contains all the information needed to calculate the posterior p(z|y).
You may ask, in the probability density transformation section, we transformed the posterior to a function over ℝ domain, and from then on, we need to use the transformed posterior.
But ELBO does not mention the posterior at all.
What shall we do?.In ELBO, we have p(y, z) = p(y|z)p(z).
And the transformed posterior is mp(y|z)p(z) with m being a constant.
So we can still plug in the transformed result in ELBO.
m is a constant; it won’t change the solution of the optimisation.
Going back to Listing 1, line 14 constructs the ELBO as the optimisation objective.
Gradient descent to maximise ELBO(q(z))Since we know the analytical form of ELBO(q(z)), we can use gradient descent to maximise ELBO(q(z)) with respect to its parameters vp.
Actually, we use gradient descend to minimise -ELBO(q(z)).
We need to calculate the gradient of ELBO(q(z)) with respect to the variational parameters vp:where the ????.symbol stands for the vector differential operator.
It calculates the gradients of a multi-argument function with respect to its arguments.
In our case, ????_vp calculates the gradient of ELBO(q(z)) with respect to the vector of parameters in vp.
A subtle point.
In the above formula, the integration is with respect to the model parameters z=[σ_level, σ_slope, σ_obs].
But the gradient calculation outside of the integration is with respect to the variational parameters vp=[μ_l, μ_s, μ_o, σ_l, σ_s, σ_o].
So the integration and the differentiation does not cancel each other.
Because of the integration, analytically calculating this gradient of ELBO(q(z)) is hard.
Since the integration calculates an expectation, we can use sample average to approximate it.
At each optimisation step, we can sample some values of z from the current variational distribution q(z).
This means to sample some values σ_level, σ_slope and σ_obs from current q(z).
And use sample average to approximate the integration.
In formula:The last line uses n samples Z_1 to Z_n from the current q(z) distribution to calculate sample average.
I emphasise that at each optimisation step, we need to draw new samples from the current q(z).
This is because each optimisation step changes the q(z) distribution by changing the values of the variational parameters.
However, the above formula has a problem when we want to apply gradient descent.
To see that, let’s keep manipulating the above formula:At line 6, the API of the ELBO formula is explicit.
Look at the first term.
With Y and Z_i plugged in, log(p(Y, Z_i)) becomes a constant.
That’s why we have a “λ.
” symbol leading its API.
The ????_vp operator calculates the gradient of this constant with respect to vp.
The gradient is 0.
This seems to suggest that the likelihood does not play any role (since it disappears) in the optimisation problem.
This does not make sense.
The reason for this problem is that which samples Z_i we get from q(z) depends on the variational parameters.
And we are optimising those variational parameters using gradient descent.
But drawing samples from q(z) happens outside of the gradient descent algorithm.
This makes the gradient descent algorithm lose the knowledge about q(z).
The symptom is that the likelihood term disappears when computing the gradient.
The reparameterization trick solves this problem.
Reparameterization trickIn our case, sampling is inevitable because of a hard to compute integration.
But sampling happens outside of the gradient descent framework.
To solve the problem, we should sample from a simpler distribution.
And this distribution must not depend on the variational parameters.
The idea of the reparameterization trick is, for each model parameter in z, for example, σ_level:Introduce a new random variable, θ_level~N(0, 1).
So θ_level is from a fixed Gaussian distribution.
It does not depend on any variational parameter.
We call θ_level a reparameterization variable.
Rewrite our original model parameter σ_level as a function of this new variable θ_level.
This function will mention the corresponding variational parameter μ_l, σ_l as well.
Let’s call this function reparameterization function.
In the ELBO formula, which is an integration, replace every occurrence of that model parameter σ_level with the reparameterization function.
In formula, let’s continue to use the model parameter σ_level as an example:We can rewrite σ_level as a function of θ_level as follows:Note in this function, μ_l and σ_l are variational parameters.
If you don’t know how we come up with such a function, it’s OK.
People have been working on how to design such functions for different distributions, see here.
Many of them are not trivial!The important thing for us is: σ_level is still a Gaussian distribution with mean μ_l and variance σ_l.
This is because the first term at the right hand side of the equationis a linear transformation from the Gaussian variable θ_level~N(0, 1) with no added noise.
We already know how to write down its distribution:Now the reparameterization function adds μ_l to this Gaussian distribution, resulting in the Gaussian distribution N(μ_l, σ_l).
In other words, this reparameterization keeps the distribution of σ_level unchanged.
Let’s define θ=[θ_level, θ_slope, θ_obs] as the vector of reparameterization variables.
Let the reparameterization function be r.
In our example of θ_level:This trick turns q(z) into a function of θ and vp:ELBO(q(z)) becomes a function of the variational parameters vp and and the newly introduced variable vector θ.
We introduced θ because we can sample it independently from the variational parameters.
When θ is replaced with samples Θ~N(0, 1), the ELBO becomes a function of vp only.
In formula:Wow, this is scary.
Let me explain:The first line is the ELBO notation.
The second line is the original definition of ELBO.
The third line applies the rule of integration by substitution, since we want to replace occurrences of z with function r(θ) inside an integration.
Previously, we’ve already seen this rule in the context of adapting random variable domains.
The fourth line reduces the derivative, re-organises terms to make the probability density of N(θ; 0, 1) pop-up.
N(θ; 0, 1) stands for the probability density of the multivariate Gaussian for θ (3 dimensional); it has mean 0 and identity variance matrix 1.
You may not see how this is derived, I will explain later.
It is important to notice that now we have the definition of an expectation with respect to θ.
So we can approximate this expectation with sample average.
Approximate the expectation at line 4 with sample average.
Samples Θ_1:n are from the N(0, 1) distribution.
Crucially, this distribution DOES NOT depend on the variational parameters vp.
The sixth line shows that with observations Y and samples Θ_i:n plugged in, ELBO becomes a function of the variational parameters vp only.
So gradient descent can work properly, nothing will disappear.
Now, I will demonstrate how we arrived at the fourth line from the third line above.
To keep things simpler, let me treat θ as a single dimensional variable and use symbols θ, μ and σ, instead of θ_level, μ_l and σ_l.
And q becomes a univariate Gaussian distribution:Explanation:Line 1, the line that we want to manipulate.
Line 2, plug in the definition of q, which is a univariate Gaussian distribution for our demonstration.
Line 3, expand the definition of r(θ).
Line 4, simplify by cancelling terms and calculating the derivative term.
Line 5, simplify formula further.
Recognise the first part inside the integration is the probability density of N(0, 1).
Replace probability density function with its name.
We can generalise this univariate case to multivariate case.
One last thing, I promiseThank you for reading this long.
I know it’s been a long journey.
I really want to say “that’s it”.
But, there is one more thing.
The gradient descent algorithm works on variables that have the full real domain ℝ.
It does not allow you to specify bounds to the variables.
However, if we look at the variational distribution q(z) again:Since σ_l, σ_s and σ_o represents standard deviations, they must be non-negative.
By now we should get into the habit of reformulating variables with functions.
That’s exactly what we should do here.
We introduce three new variables ϕ_l, ϕ_s and ϕ_o and rewrite σ_l, σ_s and σ_o as functions of of them:This way, ϕ_l, ϕ_s and ϕ_o can take values from the full real domain while σ_l, σ_s and σ_o stays non-negative only.
q(z) becomes:With this final step (and of course, everything above), the ELBO becomes a function of μ_l, μ_s, μ_o, ϕ_l, ϕ_s, and ϕ_o.
All six variables can take values from the full real domain ℝ.
Gradient descent can finally go through.
We have two places where we adapted variable domains from ℝ⁺ to ℝ:Previously, we adapted the domain of our model parameters so q(z) and p(z|y) have the same domain ℝ.
Here, we adapted the domain of our standard deviation related variational parameters so all variational parameters have domain ℝ.
Mean related variational parameters already have domain ℝ.
You may ask, what do we gain by choosing q(z) as a product of Gaussians?. More details