# The Kalman Filter and (Maximum) Likelihood

How about three?” <a pause for emphasis> “Let’s agree not to compute anything from the data without knowing what it’s estimating” (paraphrasing from Professor David A Dickey).

That stuck with me, and I never liked “computing” quantities that have no theoretical justification, even if they were useful.

Maximum Likelihood Estimation (“MLE”), for all its faults, is a principled general-purpose method of estimating unknown quantities.

It is applicable when, along with the data, you have a probabilistic model of that data depending on those unknown quantities.

The Kalman Filter gives us a model of time series data, and the likelihood is a “byproduct” of the Kalman Filter operations.

That will be clarified shortly.

In in the next section, we’ll explore the intermediate computations that get the likelihood once the state filtering is done.

Then a new section will demonstrate this in statsmodels with an ARMA(2, 1) model in statespace form.

Finally we’ll use this likelihood to get estimates of A, H, and Q (R will be known) in a principled way.

The Kalman Filter and LikelihoodLikelihood for sequentially arriving dataThe following product expansion of the joint pdf is especially useful in time series situations:In our situation, there are model matrices A, H, Q, and R, which all have potentially unknown components.

Let Θ contain all unknown parameters necessary to construct these matrices.

Then the logarithm of the likelihood becomes:Hence, there are two analytical tasks:get the log likelihood of the initial point, y_0,get the log likelihood of the current point conditional on the past, y_t | y_(t-1), …, y_0.

Recall that the “filtering” operations of the Kalman Filter give us the conditional distribution of the unknown state vector x_t given all observations y_t up to the present time point t (script Y_t):In Yet Another Kalman Filter Explanation Article (my last article), this distribution was presented as a Bayesian posterior on the unknown state at time t, but here this is simply the sampling distribution of the latent random vector x_t.

Conditional contributions to the likelihoodUnraveling the measurement model once we have:which shows us that y_t | y_{t-1}, …, y_0 must be normal since x_{t-1} | y_{t-1}, …, y_0 is.

It also gives us a starting point for computing the mean and the variance of this conditional normal.

The conditional mean is just:The conditional variance is found using standard variance formulas for random vectors:Recall that Q, R, A and H are all fully specified by the parameters in Θ.

These parameters need starting values in a numerical maximization routine, and hence the computations above need to be made at each stage of the likelihood maximization.

Initial point likelihood contributionThe distribution of the initial measurement depends on the necessary initialization of the state mean vector and variance matrix.

Of course this must happen before any data are observed.

In the case of a stationary model, there is a correct initial (a priori) mean and variance of the state vector, and these are functions of Q, R, A based on long run behavior of the system (see page 2 of these notes from University of Washington for more details).

For non-stationary models, the initialization is more conceptually difficult.

The solution provided by statsmodels is an “approximate diffuse” initialization, i.

e.

, a zero mean and a very large covariance matrix.

Statisticians don’t love this (“This poses numerical problems and does not answer the question of existence of diffuse constructs” — Piet de Jong, 1991), so later on in this article we’ll investigate the effects of this approximate diffuse initialization on one stationary time series.

Despite working with a well understood stationary model, we will use a “known” initialization in the example to come.

This is for convenience (not having to calculate the right initial mean and variance matrix), but also affords us the opportunity to learn about the practical effects of the different initializations offered by statsmodels.

Example: ARMA Models in state space formFor this article will work with one specific ARMA(1, 2) model:where ϵ_t ~ N(0, 1.

2²) for t = 1, …, 1000.

Generating ARMA(1, 2) dataUsing statsmodels to simulate data from from an ARMA(1, 2) is simple, but take note of the way that AR and MA parameter coefficients are entered:import numpy as npfrom statsmodels.

tsa.

arima_process import ArmaProcessnp.

random.

seed(20190529)ar1ma2 = ArmaProcess(ar=np.

array([1, -.

9]), ma=np.

array([1, .

2, -.

1]))y = ar1ma2.

generate_sample(nsample=1000, scale=1.

2)Here’s a look at the first few observations:In [6]: print(y[0:5])[1.

41505527 1.

38650169 1.

54085345 2.

53833794 3.

9467489 ]The statsmodels ARMA class has a fit() method for fitting the ARMA model:from statsmodels.

tsa.

arima_model import ARMAmodel = ARMA(y, (1, 2))model_fit = model.

fit(trend='nc')This results in the output (slightly modified):In [13]: model_fit.

summary() Out[13]: <class 'statsmodels.

iolib.

summary.

Summary'> """ ARMA Model Results ==================================================================Dep.

Variable: y No.

Observations: 1000Model: ARMA(1, 2) Log Likelihood -1629.

051================================================================== coef std err z P>|z| [0.

025 0.

975]——————————————————————ar.

L1.

y 0.

9008 0.

017 51.

919 0.

000 0.

867 0.

935ma.

L1.

y 0.

1474 0.

037 4.

036 0.

000 0.

076 0.

219ma.

L2.

y -0.

1360 0.

037 -3.

715 0.

000 -0.

208 -0.

064Note the log-likelihood of the sample as well as the coefficient estimates (in bold).

The ARMA(1, 2) model in state space formTo get this ARMA(1, 2) model in a state space framework, we have many choices.

The benefit of the “Harvey” representation (as presented on page 8 of these Wharton lecture notes) is that it directly incorporates the AR and MA coefficients.

For our model, this representation is:where r_t has been explicitly replaced by 0 to show that all error variability is handled in the state term, andDespite R being zero, this ARMA(1, 2) model fits into the Kalman Filter framework.

Here is an exact coding of this model instatsmodels, albeit one that is more verbose than is necessary:from statsmodels.

tsa.

statespace.

mlemodel import MLEModelfrom statsmodels.

tsa.

statespace.

tools import (constrain_stationary_univariate, unconstrain_stationary_univariate)class AR1MA2_verbose(MLEModel): start_params = [.

8, 0.

24, -.

11, 1.

3] param_names = ['ar1', 'ma1', 'ma2', 'sigma2'] def __init__(self, endog): super().

__init__(endog, k_states = 3) # Using the Harvey Representation self['transition', 0, 1] = 1.

0 self['transition', 1, 2] = 1.

0 self['design', 0, 0] = 1.

0 self['selection', 0, 0] = 1.

0 self['selection', 1, 1] = 1.

0 self['selection', 2, 2] = 1.

0 self.

initialize_known(np.

array([0, 0, 0]), np.

eye(3)) #self.

initialize_stationary() #self.

initialize_approximate_diffuse() def transform_params(self, params): phi = constrain_stationary_univariate(params[0:1]) theta = constrain_stationary_univariate(params[1:3]) sigma2 = params[3] ** 2 return np.

r_[phi, theta, sigma2] def untransform_params(self, params): phi = unconstrain_stationary_univariate(params[0:1]) theta = unconstrain_stationary_univariate(params[1:3]) sigma2 = params[3] ** 0.

5 return np.

r_[phi, theta, sigma2] def update(self, params, **kwargs): params = super().

update(params, **kwargs) self['transition', 0, 0] = params[0] self['state_cov', 0, 0] = params[3] * 1.

0 self['state_cov', 1, 0] = params[3] * params[1] self['state_cov', 2, 0] = params[3] * params[2] self['state_cov', 0, 1] = params[3] * params[1] self['state_cov', 1, 1] = params[3] * params[1] ** 2 self['state_cov', 2, 1] = params[3] * params[1] * params[2] self['state_cov', 0, 2] = params[3] * params[2] self['state_cov', 1, 2] = params[3] * params[1] * params[2] self['state_cov', 2, 2] = params[3] * params[2] ** 2 pass # adding this to ease some cpaste issuesThe documentation for the statsmodels model representation is here.

Below are a few notes about the implementation above.

Note the “selection” matrix is set equal to the identity in __init__.

By order of the statsmodels statespace representation, this matrix is premultiplied by the state space error and is zero by default (i.

e.

, you have to explicitly code it if you want state error!).

The selection matrix can be set to (1 θ_1 θ_2)^T to make the above representation much simpler.

The start_params variable will initialize the maximum likelihood routine, but we’ll also use it in the demonstrations to come.

Note that the values are close to the truth, but not exact (and that’s okay).

The parameter transformations are not necessary for the likelihood demonstrations, but critically necessary to the likelihood maximization routines to work (in my experience).

That is, these estimation procedures are not necessarily robust out of the box (again, in my experience).

As mentioned before, we’re starting with a known initialization of the latent state mean and covariance matrix (zero mean and identity variance matrix) for demonstration of the likelihood calculations.

Computing the log likelihood for a few pointsWe’ll start by creating an instance of the Kalman Filter model above, and initializing it with the aforementioned starting values.

Since these parameter starting values contain all necessary information for reconstructing A, H, Q, and R, the Kalman Filter machinery of statsmodels can filter right away.

kf_model_verbose = AR1MA2_verbose(y)filtered = kf_model_verbose.

filter(kf_model_verbose.

start_params)Running interactively, we can see that filtered_state contains a 3 element state vector, and filtered_state_cov contains a 3×3 matrix, for all 1000 time points.

In [25]: filtered.

filtered_state.

shapeOut[25]: (3, 1000)In [26]: filtered.

filtered_state_cov.

shapeOut[26]: (3, 3, 1000)In [27]: kf_model_verbose.

loglikeobs .

[0:3]Out[27]: array([-1.

92012925, -1.

34946888, -1.

37622846])The last line shows the log-likelihood contributions of the first three measurements according to statsmodels, and are what we want to match.

We will now use the formulas established in the section, Likelihood for sequentially arriving data.

Having already defined A, H, Q, and R, statsmodels makes these easily accessible.

For convenience, the matrix product HA is calculated and stored in HA below:A = kf_model_verbose['transition', 0:, 0:]H = kf_model_verbose['design', 0:, 0:]Q = kf_model_verbose['state_cov', 0:, 0:]R = kf_model_verbose['obs_cov', 0:, 0:]HA = np.

matmul(H, A)To match the first number in the likelihood vector, -1.

9201, first note that μ and Σ are initialized at the zero vector and the identity matrix.

The measurement vector takes only the first element of the state vector without an extra error term (R is zero), so the first observation y[0] is compared to a N(0, 1) distribution.

from scipy.

stats import normnp.

log(norm.

pdf(y[0], 0, 1))Out[36]: -1.

9201292483430195Since python starts with zero-indexed time, μ_0 and Σ_0 are actually the first updated state mean and variance matrix given the initial data measurement.

Following the computations laid out in the likelihood derivation,mu_0 = filtered.

filtered_state[0:, 0]Sigma_0 = filtered.

filtered_state_cov[0:, 0:, 0]E_alpha_1 = np.

matmul(np.

matmul(H, A), mu_0)V_alpha_1 = (np.

matmul(np.

matmul(HA, Sigma_0), np.

transpose(HA)) + np.

matmul(np.

matmul(H, Q), np.

transpose(H)) + R)np.

log(norm.

pdf(y[1], E_alpha_1, np.

sqrt(V_alpha_1)))results in array([[-1.

34946888]]), matching the second measurement’s log-likelihood.

One more round through should be convincing that we have the pattern down.

mu_1 = filtered.

filtered_state[0:, 1]Sigma_1 = filtered.

filtered_state_cov[0:, 0:, 1]E_alpha_2 = np.

matmul(np.

matmul(H, A), mu_1)V_alpha_2 = (np.

matmul(np.

matmul(HA, Sigma_1), np.

transpose(HA)) + np.

matmul(np.

matmul(H, Q), np.

transpose(H)) + R)np.

log(norm.

pdf(y[2], E_alpha_2, np.

sqrt(V_alpha_2)))Out[46]: array([[-1.

37622846]])Again, it’s a match.

If we sum the loglikelihood over the 1000 measurements, we get something a little smaller (i.

e.

, more negative) than what we saw with the ARMA fit, which should be expected given our likelihood evaluations are not at the maximum likelihood estimates.

np.

sum(kf_model_verbose.

loglikeobs(kf_model_verbose.

start_params))Out[47]: -1655.

0364388567427Maximum likelihood estimation of unknown parametersThe Kalman Filter object created above has been hooked up for Maximum Likelihood Estimation for a while.

All it takes is running the object’s fit() method:kf_model_verbose_fit = kf_model_verbose.

fit()kf_model_verbose_fit.

summary()The summary provides (slightly modified output):Statespace Model Results ==================================================================Dep.

Variable: y No.

Observations: 1000Model: AR1MA2_verbose Log Likelihood -1629.

327================================================================== coef std err z P>|z| [0.

025 0.

975]——————————————————————ar1 0.

9016 0.

018 50.

742 0.

000 0.

867 0.

936ma1 0.

1472 0.

038 3.

900 0.

000 0.

073 0.

221ma2 -0.

1366 0.

037 -3.

686 0.

000 -0.

209 -0.

064sigma2 1.

5219 0.

071 21.

542 0.

000 1.

383 1.

660Note that the parameter estimates and log likelihood are close to the ARMA estimates, but not exact.

This is due to the known initialization that was not theoretically justified.

As an exercise, run the following two variations, which will be discussed in the Discussion section:Comment out the known initialization, uncomment the stationary initialization, and rerun.

Comment out the stationary initialization, uncomment the “approximate diffuse” initialization, and rerun.

DiscussionFrom the exercise above, you should have observed that you can exactly match the ARMA model’s log-likelihood and parameter estimates with the stationary initialization.

This should not be surprising, since the documentation for the ARMA class’s fit()method says that it “fits ARMA(p,q) model using exact maximum likelihood via Kalman filter.

”Fortunately, the approximate diffuse initialization led to similar estimates as its exact and known counterparts, which makes me feel better about using it in the future.

The initial log-likelihood values are quite a bit smaller (< -7) so the total log-likelihood of the sample is a bit more negative.

A “burn in” option exists to skip these first few observations in computing the likelihood by setting self.

loglikelihood_burn in the class’s __init__ function.

Once the machinery is in place to get the filtered state mean and variance, the likelihood for each point is only a few linear operations away.

If there are unknown parameters in any of the Kalman Filter model matrices, then both the filtered objects and the log-likelihood values will be computed conditionally on some set of parameter values (starting or otherwise).

While I have not looked at the statsmodels code, I cannot see a way around running the recursive Kalman Filter routines through the entire sample to evaluate the likelihood for one set of parameter values.

Combined with a numerical optimization routine, this means a double loop.

That would explain why fitting even a simple ARMA(1, 2) model slows down when the number of observations gets moderately large.

Alongside a double loop of likelihood maximization, there is also a “double loop of estimation.

” In Yet Another Kalman Filter Explanation Article, A, H, Q, and R were known completely but there was still a “likelihood” and a “posterior” for an unknown quantity.

In that framework, the contents of state vector at every time step were the unknown parameters.

In this article, the unknown parameters are in A, H, Q, and R and the latent state vectors are viewed as random vectors, not parameters (in true Frequentist form).

As I learned from os and Roger Labbe, if you only care about state estimation, there are practical methods of constructing a filter that do not involve formal estimation of unknown model matrices.

Perhaps I will use one of these methods someday, if I can ever stop answering the question: “Do you like the number seven?”.