There’s a ton of explanations for the Kalman filter in blog posts all over the web, including Wikipedia pages in more than thirty languages. There’s books with multi-page derivations of the equations^{1}.

I’ve looked at a few as a refresher, but I noticed that I couldn’t find any intuition for the update of the state covariance matrix. Even for the tutorials that explain the intuitions very well, with all kinds of examples and illustrations: the covariance update is only ever stated by the equations below^{2}.

The alternative formulation I found then made it obvious, that there is also a more understandable equation for the update of the state itself.

On the Kalman filter’s Wikipedia page (I will use its notation but drop the subscripts, as they are unnecessary here), the measurement update step of the Covariance is given by the following equations.

The Innovation |
\[ \tilde{\mathbf{y}} = \mathbf{z}-\mathbf{H}\hat{\mathbf{x}} \quad\qquad\qquad(1) \] |

The Innovation Covariance |
\[ \mathbf{S} = \mathbf{H}\hat{\mathbf{P}}\mathbf{H}^{\textsf{T}}+\mathbf{R} \quad\qquad(2) \] |

The Optimal Kalman Gain |
\[ \mathbf{K}={\hat{\mathbf{P}}}\mathbf{H}^{\textsf{T}}\mathbf{S}^{-1} \qquad\qquad\,(3) \] |

The new A Posteriori State |
\[ \mathbf{x} = \hat{\mathbf{x}}+\mathbf{K}\tilde{\mathbf{y}} \quad\qquad\qquad(4) \] |

The new A Posteriori State Covariance |
\[ \mathbf{P}=\left(\mathbf{I}-\mathbf{K}\mathbf{H}\right){\hat{\mathbf{P}}}\quad\qquad(5) \] |

where the inputs are

\(\hat{\mathbf{x}}\) (with the hat), is the (“a priori”) intermediate state estimate after the prediction step

\(\hat{\mathbf{P}}\) (with the hat), is the respective predicted covariance.

\(\mathbf{H}\) is the observation model, which projects from the state space into the measurement space. So \(\mathbf{H}\hat{\mathbf{x}}\) is the predicted measurement.

\(\mathbf{z}\) is the measurement itself

\(\mathbf{R}\) is the measurement covariance matrix.

\(\mathbf{I}\) is of course the identity matrix.

Some authors may use slightly different equations, or do not separate Equations 2 and 3, but fundamentally, those are the Kalman Filter equations.

So, **what is the intuition behind those equations?** It starts easy with the first two, but from the third one it gets fuzzier and fuzzier, just looking at the equations.

Many authors do offer an intuition for Equation 4 (the state update), but in terms of a vague description, such as “gives less influence to the measurement, if it is more uncertain” and often by visualizing what happens.

However, and that prompted me to write this post, **nobody seems to explain that last equation** in any more detail than “that’s how it is derived/computed/done”^{3}.

Part of the problem may be, that the equations for the update step of the Kalman filter are expressed using the Kalman gain matrix \(\mathbf{K}\), as shown above.

Alex Becker from [kalmanfilter.net] calls the discovery of \(\mathbf{K}\), “one of Rudolf Kalman’s significant contributions” (and he’s probably right).

However, for the understanding the measurement update of the state and its estimate covariance, I think **the Kalman gain matrix gets in the way of a more intuitive formulation**.

That may be a bold statement, after all that Matrix bears the name of the algorithm, so it must convey a crucial insight, right? Yes and no. One can definitely say that \(\mathbf{K}\) steals the spotlight from alternative, better understandable equations, but more on that later.

Well, let’s see where we get, if we do not make that substitution in the third equation, but keep those terms on its right hand side explicitly. Let’s start with the covariance.

First, let’s skip the definition of the Kalman gain matrix and put the right hand side of Equation 3 directly into Equation 5 for computing \(\mathbf{P}\): \[ \mathbf{P}=\left(\mathbf{I}-{\hat{\mathbf{P}}}\mathbf{H}^{\textsf{T}}\mathbf{S}^{-1}\mathbf{H}\right){\hat{\mathbf{P}}} \]

Admittedly, that is not better than Equation 5. But let’s continue and also eliminate the parenthesis \[ \mathbf{P}={\hat{\mathbf{P}}}-{\hat{\mathbf{P}}}\mathbf{H}^{\textsf{T}}\mathbf{S}^{-1}\mathbf{H}\hat{\mathbf{P}} \] and finally substitute \(\mathbf{S}\) using Equation 2 \[ \mathbf{P}={\hat{\mathbf{P}}}-{\hat{\mathbf{P}}}\mathbf{H}^{\textsf{T}}(\mathbf{H}{\hat{\mathbf{P}}}\mathbf{H}^{\textsf{T}}+\mathbf{R})^{-1}\mathbf{H}\hat{\mathbf{P}} \qquad(6) \] Ok, now we have the “full picture”. But where is the better understanding that I proclaimed in the title? Well, to understand Equation 6, let’s have a look at the matrix inversion lemma.

\[ \left(\mathbf{A}+\mathbf{U}\mathbf{C}\mathbf{V}\right)^{-1}=\mathbf{A}^{-1}-\mathbf{A}^{-1}\mathbf{U}\left(\mathbf{C}^{-1}+\mathbf{V}\mathbf{A}^{-1}\mathbf{U}\right)^{-1}\mathbf{V}\mathbf{A}^{-1} \] This lemma is very useful to speed up computations, if \(\mathbf{A}^{-1}\) is already known and of bigger dimensions than \(\mathbf{C}\). Then the left hand side can be computed more efficiently, by instead computing the right hand side.

Let’s quickly rewrite it - in two steps, for clarity. First we replace \(\mathbf{H} = \mathbf{V} = \mathbf{U}^T\) \[ \left(\mathbf{A}+\mathbf{H}^T \mathbf{C}\mathbf{H}\right)^{-1}=\mathbf{A}^{-1}-\mathbf{A}^{-1}\mathbf{H}^T\left(\mathbf{C}^{-1}+\mathbf{H}\mathbf{A}^{-1}\mathbf{H}^T\right)^{-1}\mathbf{H}\mathbf{A}^{-1} \] and in the next step we replace \(\hat{\mathbf{P}} = \mathbf{A}^{-1}\) and \(\mathbf{R} = \mathbf{C}^{-1}\) and obtain \[ \left(\hat{\mathbf{P}}^{-1}+\mathbf{H}^T \mathbf{R}^{-1}\mathbf{H}\right)^{-1}=\hat{\mathbf{P}}-\hat{\mathbf{P}}\mathbf{H}^T\left(\mathbf{R}+\mathbf{H}\hat{\mathbf{P}}\mathbf{H}^T\right)^{-1}\mathbf{H}\hat{\mathbf{P}}. \] Now the right hand side is equal to that of equation 4. Which means equation 4 can be rewritten with the much nicer left hand side of the above result: \[ \mathbf{P}=\left(\hat{\mathbf{P}}^{-1}+\mathbf{H}^T \mathbf{R}^{-1}\mathbf{H}\right)^{-1} \qquad(7) \]

So, what is actually happening in Equation 7? **We are adding the precision of the measurement, and the precision of the predicted state, to obtain the precision of the corrected state**. To do so, the respective covariance matrices \(\hat{\mathbf{P}}\) and \(\mathbf{R}\) are first inverted, to obtain precision matrices. The measurement precision matrix is projected to the state space using \(\mathbf{H}\). After adding the matrices, the sum is converted back to the covariance matrix \(\mathbf{P}\), by inverting it.

Voila! Much clearer (to me). And by the way, that is the standard procedure^{4} to obtain the new covariance (or precision) matrix when multiplying two Gaussians^{5}: \(\mathbf{\Sigma_{12}}=(\mathbf{\Sigma_{1}}^{-1}+\mathbf{\Sigma_{2}}^{-1})^{-1}\)

Of course, **implementing it as such would be inefficient**, because slower inversions are needed. That’s why the matrix inversion lemma is useful in the first place, and why the Kalman Gain is a great discovery *for efficiency*. But for *understanding* the Kalman filter^{6}, this seems much better to me and I wonder why it isn’t usually explained like this.

I haven’t come around to typeset the derivation equations (and maybe it is not worthwhile?), so for now here is only the result. As for the covariance update we can see better how the weighing of the predicted state versus the measurement works, when looking at the standard result for the multiplication of Gaussians^{7}:

\[\mathbf{\mu_{12}}=(\mathbf{\Sigma_{12}}^{-1}+\mathbf{\Sigma_{2}}^{-1})^{-1}(\mathbf{\Sigma_{1}}^{-1}\mathbf{\mu_{1}}+\mathbf{\Sigma_{2}}^{-1}\mathbf{\mu_{2}})\]

And indeed, we can transform Equation 4 (i.e., \(\mathbf{x} = \hat{\mathbf{x}}+\mathbf{K}\tilde{\mathbf{y}}\)) to exactly that form:

\[ \mathbf{x} = \left(\hat{\mathbf{P}}^{-1}+\mathbf{H}^T \mathbf{R}^{-1}\mathbf{H}\right)^{-1} \left(\hat{\mathbf{P}}^{-1} \mathbf{\hat{x}}+ \mathbf{H}^T \mathbf{R}^{-1}\mathbf{z}\right) \]

In contrast to the equation above, \(\mathbf{x}\) and \(\mathbf{z}\) do not live in the same space, so \(\mathbf{H}\) is needed to project from the measurement space to the state space and back.

But the main takeaway is that **the new (a posteriori) mean is computed as the multivariate weighted mean of predicted and measured state**: The sum of the means is weighted by division by their respective covariances. Then the result is renormalized by multiplying with the combined covariance.

It’s again a little bit simpler, when thinking in terms of precision matrices: Multiply the predicted and measured state by their respective precision and divide by the sum of the precisions.

The

*Probabilistics Robotics*book has all the math you’d ever need on the Kalman Filter, including its derivation from a Bayes Filter.↩I must admit that, after writing this post, I did find the update step equations in the form derived here in the book

*Computer Vision - Models, Learning and Inference*by Simon Prince. Without too much explanation.↩Though, if you already know why it would be computed that way, you can find the connection to the Kalman filter a little bit hidden here. Of course, if you knew, you would probably make that connection yourself.↩

See The Matrix Cookbook, Chapter 8.1.8 “Product of gaussian densities”↩

And the update state is

*a multiplication*of the state distribution with the conditional observation distribution (ignoring the normalization), i.e., \(p({\textbf {x}}_{t}|{\textbf {z}}_{1:t})={\frac {p({\textbf {z}}_{t}|{\textbf {x}}_{t})p({\textbf {x}}_{t}|{\textbf {z}}_{1:t-1})}{p({\textbf {z}}_{t}|{\textbf {z}}_{1:t-1})}}\propto p({\textbf {z}}_{t}|{\textbf {x}}_{t})p({\textbf {x}}_{t}|{\textbf {z}}_{1:t-1})\)↩You could argue that the alternative equations aren’t the Kalman filter, but a linear Bayes Filter with all distributions being Gaussians. That’s true, but since they only differ in the computational efficiency, insights about the behavior of one will also hold for the other.↩

See The Matrix Cookbook, Chapter 8.1.8 “Product of gaussian densities”↩

Home