Jabir Hussain
Matrix factorisations (continued)
Recap: eigenvalues/eigenvectors for Hermitian matrices
Eigenvalue problem: given $A\in\mathbb{C}^{n\times n}$, find $\lambda\in\mathbb{C}$ and $q\in\mathbb{C}^n$ such that
\[Aq=\lambda q, \qquad |q|=1.\]If $A$ is Hermitian/self-adjoint ($A=A^*$), then:
- all eigenvalues $\lambda_1,\dots,\lambda_n$ are real
- there exists an orthonormal eigenbasis $q_1,\dots,q_n$ with
- writing $Q=[q_1\ \cdots\ q_n]$ and $\Lambda=\mathrm{diag}(\lambda_1,\dots,\lambda_n)$,
Using outer products:
\[A=\sum_{i=1}^n \lambda_i\,(q_i\otimes q_i).\]Recall: for vectors $a,b$, the outer product acts by
\[(a\otimes b)c=\langle a,c\rangle\,b.\]The singular value decomposition (SVD)
Theorem (SVD)
Let $A\in\mathbb{K}^{m\times n}$ (with $\mathbb{K}=\mathbb{R}$ or $\mathbb{C}$). Then there exist unitary matrices
\[U\in\mathbb{K}^{m\times m}, \qquad V\in\mathbb{K}^{n\times n},\]and a diagonal (rectangular) matrix $\Sigma\in\mathbb{R}^{m\times n}$ with diagonal entries
\[\sigma_1\ge\sigma_2\ge\cdots\ge 0\](and zeros off-diagonal) such that
\[A=U\Sigma V^*.\]Equivalently: there exist orthonormal bases $u_1,\dots,u_m$ of $\mathbb{K}^m$ and $v_1,\dots,v_n$ of $\mathbb{K}^n$ and nonnegative numbers $\sigma_1,\dots,\sigma_r$ (where $r=\mathrm{rank}(A)$) such that
\[A=\sum_{i=1}^r \sigma_i\,(v_i\otimes u_i).\]Action on a vector $x\in\mathbb{K}^n$:
\[Ax=\sum_{i=1}^r \sigma_i\,(v_i\otimes u_i)x = \sum_{i=1}^r \sigma_i\,u_i\langle v_i,x\rangle = \sum_{i=1}^r \sigma_i\,u_i v_i^*x.\]Remarks
- The leading singular value is the operator norm:
- Truncating the SVD gives a rank-$p$ approximation (“truncated SVD”, related to PCA):
-
Later: compact operators will have an SVD with infinitely many terms and $\sigma_i\to 0$ as $i\to\infty$.
-
The SVD gives a generalised inverse (pseudoinverse) $A^\dagger\in\mathbb{K}^{n\times m}$ of
via
\[A^\dagger=\sum_{i=1}^r \frac{1}{\sigma_i}\,(u_i\otimes v_i), \qquad \sigma_i\ne 0\ \text{for }i\le r.\](Annotation in notes: “not technically an SVD because singular values are in the wrong order”.)
Exercises (from the notes)
Let $A=\sum_{i=1}^r \sigma_i\,(v_i\otimes u_i)$ be an SVD. Show:
\[A^*=\sum_{i=1}^r \sigma_i\,(u_i\otimes v_i), \qquad A^*A=\sum_{i=1}^r \sigma_i^2\,(v_i\otimes v_i).\]What is $AA^*$?
Least squares and normal equations
Given $A\in\mathbb{K}^{m\times n}$ and $b\in\mathbb{K}^m$, consider the least squares problem: find $x\in\mathbb{K}^n$ minimising
\[\Phi(x):=\frac{1}{2}|Ax-b|^2 =\frac{1}{2}\sum_{j=1}^m |(Ax)_j-b_j|^2.\]Gradient/Jacobian (as written):
\[\nabla \Phi(x)=A^*Ax-A^*b.\]Hessian/second derivative (constant in $x$):
\[\nabla^2\Phi(x)=A^*A,\]and $A^*A$ is SPSD.
Therefore, minimisers are precisely the solutions of the normal equations:
\[A^*Ax=A^*b.\]Special case: if $A$ has full column rank, then $A^*A$ is invertible and the minimiser is unique:
\[x^\dagger:=(A^*A)^{-1}A^*b.\](Annotation in notes: if $A$ has a non-trivial kernel, you lose uniqueness of minimisers.)
If $A$ has SVD $A=\sum_{i=1}^r \sigma_i\,(v_i\otimes u_i)$, then (using the exercise identities)
\[x^\dagger=(A^*A)^{-1}A^*b=\sum_{i=1}^r \frac{1}{\sigma_i}\,(u_i\otimes v_i)b,\]i.e.
\[x^\dagger=A^\dagger b.\]Stability note
Forward map is stable:
\[|A(x+\delta x)-Ax|=|A(\delta x)|\le |A|_{\mathrm{op}}|\delta x|=\sigma_1|\delta x|.\]Inverse map can be unstable:
\[|A^\dagger(b+\delta b)-A^\dagger b| =|A^\dagger(\delta b)| \le |A^\dagger|_{\mathrm{op}}|\delta b| =\frac{1}{\sigma_r}|\delta b|.\]Annotation: $\sigma_r^{-1}$ can be huge; reconstructions $A^\dagger b$ are sensitive to perturbations in $b$ parallel to singular vectors $u_i$ corresponding to small $\sigma_i$.
Slogan (as written): inverse/learning problems are typically ill-posed, sensitive to “minor” perturbations.
Probability (finite-dimensional)
Kolmogorov setup
-
A non-empty set $\Omega$ (sample space/outcome space).
-
A collection of measurable events $\mathcal{F}\subseteq \mathcal{P}(\Omega)$, required to be a $\sigma$-algebra:
- $\varnothing,\Omega\in\mathcal{F}$
- if $E\in\mathcal{F}$ then $\Omega\setminus E\in\mathcal{F}$
- if $E_1,E_2\in\mathcal{F}$ then $E_1\cap E_2\in\mathcal{F}$ and $E_1\cup E_2\in\mathcal{F}$
- if $E_1,E_2,\dots\in\mathcal{F}$ then $\bigcup_{n\in\mathbb{N}}E_n\in\mathcal{F}$ and $\bigcap_{n\in\mathbb{N}}E_n\in\mathcal{F}$
(Annotation: $\mathcal{F}$ is like a list of questions it is “ok” to ask about the outcome $\omega\in\Omega$; if $E\in\mathcal{F}$ you may ask whether $\omega\in E$.)
- A probability measure $\mathbb{P}:\mathcal{F}\to [0,1]$ such that:
- $\mathbb{P}(\Omega)=1$ and $\mathbb{P}(\varnothing)=0$
- $\mathbb{P}(\Omega\setminus E)=1-\mathbb{P}(E)$
- if $E_1,E_2,\dots\in\mathcal{F}$ are pairwise disjoint, then
Interpretations of probability
Two main meanings of $\mathbb{P}(E)$:
- Frequentist: idealised replicable experiment; on each run $E$ happens or not; define
- Bayesian/subjectivist: can speak of probability of any statement $E$, with $\mathbb{P}(E)=1$ meaning certainty true and $\mathbb{P}(E)=0$ certainty false.
Aleatoric vs epistemic uncertainty
- Aleatoric (irreducible) uncertainty: intrinsic randomness/variability.
- Epistemic (reducible) uncertainty: due to lack of knowledge; can decrease with more information.
Bayes’ formula (events)
For $A,B\in\mathcal{F}$ with $\mathbb{P}(B)>0$, define conditional probability
\[\mathbb{P}(A\mid B):=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}.\]Then
\[\mathbb{P}(B\mid A)\mathbb{P}(A)=\mathbb{P}(A\cap B)=\mathbb{P}(A\mid B)\mathbb{P}(B),\]so
\[\mathbb{P}(A\mid B)=\frac{\mathbb{P}(B\mid A)}{\mathbb{P}(B)}\,\mathbb{P}(A).\]Annotations:
- $\mathbb{P}(A\mid B)$: posterior probability
- $\mathbb{P}(A)$: prior probability (before seeing data $B$)
- $\mathbb{P}(B\mid A)$: likelihood (from a model for what data should do if $A$ is true)
- $\mathbb{P}(B)$: normalising factor; also called the marginal likelihood / evidence
Random variables and distributions
A real-valued random variable is a function $x:\Omega\to\mathbb{R}$ defined on a probability space $(\Omega,\mathcal{F},\mathbb{P})$.
A random vector is a measurable function $x:\Omega\to\mathbb{R}^n$, i.e. $x=(x_1,\dots,x_n)$ with each $x_i$ a random variable.
The law/distribution of $x$ is the induced probability measure $\mu_x$ on $\mathbb{R}^n$:
\[\mu_x(A):=\mathbb{P}(\{\omega\in\Omega: x(\omega)\in A\}) =\mathbb{P}[x\in A] =\mathbb{P}(x^{-1}(A)).\]Gaussian random vectors (continued)
We say that an $\mathbb{R}^n$-valued random variable/vector $\underline{x}$ follows a Gaussian/normal distribution with mean vector $m\in\mathbb{R}^n$ and covariance matrix $C\in\mathbb{R}^{n\times n}$, denoted
\[\underline{x}\sim\mathcal{N}(m,C),\]if (equivalently)
-
for every $\ell\in\mathbb{R}^n$, $\ell\cdot \underline{x}$ is a real-valued normally-distributed r.v. with mean $\ell\cdot m$ and variance $\ell^\top C\ell\ge 0$;
-
the characteristic function $\mathbb{R}^n \to \mathbb{C}$
takes the form
\[\mathbb{E}\!\left[\exp(i\,\ell\cdot \underline{x})\right] =\exp\!\left(i\,\ell\cdot m - \frac{1}{2}\,\ell^\top C\ell\right);\]- [Valid only if $C$ is SPD i.e. if $C^{-1}$ exists] $\underline{x}$ has the PDF
i.e.
\[\mathbb{P}[\underline{x}\in A] = \int_A \frac{1}{\sqrt{\det(2\pi C)}}\, \exp\!\left(-\frac{1}{2}\,\left\|C^{-1/2}(x-m)\right\|^2\right)\,dx, \qquad A\subseteq \mathbb{R}^n.\]Linear images of Gaussians are again Gaussian!
Let $\underline{x}\sim\mathcal{N}(m,C)$ in $\mathbb{R}^n$ and fix $A\in\mathbb{R}^{p\times n}$ and consider
\[\underline{y}:=A\underline{x}\]in $\mathbb{R}^p$. Then $\underline{y}$ is a $\mathbb{R}^p$-valued Gaussian, and in fact
\[\underline{y}\sim\mathcal{N}(Am,ACA^\top).\]To see this:
\[\mathbb{E}\!\left[\exp(i\,\ell\cdot \underline{y})\right] = \mathbb{E}\!\left[\exp\!\left(i\,\ell\cdot (A\underline{x})\right)\right] = \mathbb{E}\!\left[\exp\!\left(i\,(A^\top \ell)\cdot \underline{x}\right)\right], \qquad \ell\in\mathbb{R}^p,\] \[= \exp\!\left(i\,(A^\top \ell)\cdot m - \frac{1}{2}\,(A^\top \ell)^\top C (A^\top \ell)\right) = \exp\!\left(i\,\ell\cdot (Am) - \frac{1}{2}\,\ell^\top (ACA^\top)\ell\right).\]Another nice property of Gaussians is that their conditional distributions are Gaussian — and the conditional mean and cov. are easy to calculate using linear algebra.
Conditioning theorem
Let
\[\underline{x}=\begin{pmatrix}\underline{x}_1\\ \underline{x}_2\end{pmatrix}\]be an $\mathbb{R}^n$-valued Gaussian, with $\underline{x}_1$ denoting the first $n_1$ components and $\underline{x}_2$ denoting the remaining $n_2$ components, $n_1+n_2=n$.
Write the mean $m$ and cov. $C$ of $\underline{x}$ in block form as
\[\mathbb{E}[\underline{x}] = m = \begin{pmatrix}m_1\\ m_2\end{pmatrix}, \qquad \mathrm{Cov}[\underline{x}] = C = \begin{pmatrix} C_{11} & C_{12}\\ C_{21} & C_{22} \end{pmatrix},\]with $m_1\in\mathbb{R}^{n_1}$, $m_2\in\mathbb{R}^{n_2}$, $C_{11}\in\mathbb{R}^{n_1\times n_1}$, $C_{12}=C_{21}^\top$, …
| Assume that $C$ and $C_{22}$ are invertible (generalizations are possible). Then, for every $x_2\in\mathbb{R}^{n_2}$, $\underline{x}_1\, | \,\underline{x}_2=x_2$ is Gaussian, |
where
\[m_1' = m_1 + C_{12}C_{22}^{-1}(x_2-m_2),\] \[C_{11}' = C_{11} - C_{12}C_{22}^{-1}C_{21}, \qquad (C_{21}=C_{12}^\top).\]Kalman update formulae
- $m_1’$: posterior mean
- $m_1$: prior mean
- $C_{12}C_{22}^{-1}$: Kalman gain matrix
- $(x_2-m_2)$: residual vector $=$ data $-\mathbb{E}[\mathrm{data}]$