Jabir Hussain
Week 4 — Monte Carlo methods
Monte Carlo methods
We now focus very much on estimating integrals of the form
\[Q(f) := \mathbb{E}_{x\sim\mu}[f(x)] = \int_{\mathbb{R}^d} f(x)\,\mu(dx),\]where $x$ is an $\mathbb{R}^d$-valued r.v. and $f:\mathbb{R}^d\to\mathbb{R}$ (or at least defined on the range of $x$); $\mu$ is the law of $x$, and we usually assume that $\mu$ has PDF $\rho:\mathbb{R}^d\to[0,\infty]$ with
\[\int_{\mathbb{R}^d}\rho(x)\,dx = 1.\]We often call $\mu$ the target distribution and $\rho$ the target density.
Key assumption: the integral defining $Q(f)$ actually converges, i.e.
\[\mathbb{E}_{x\sim\mu}[|f(x)|] < \infty.\]“Vanilla” Monte Carlo
Key enabling assumption: We can draw as many independent samples $x_1,x_2,x_3,\ldots$ as we like from the target distr. $\mu$.
In this situation, we approximate $Q(f)$ by the Monte Carlo estimator
\[\hat{Q}_J(f) := \frac{1}{J}\sum_{j=1}^J f(x_j).\]Unbiasedness and LLN
\[\begin{aligned} \mathbb{E}\!\left[\hat{Q}_J(f)\right] &= \mathbb{E}\!\left[\frac{1}{J}\sum_{j=1}^J f(x_j)\right] \\ &= \frac{1}{J}\sum_{j=1}^J \mathbb{E}[f(x_j)] \\ &= \frac{1}{J}\,J\,Q(f) \\ &= Q(f). \end{aligned}\](since $x_1,\ldots,x_J$ have the same law)
| We also have convergence as the number of samples tends to infinity by the law of large numbers (LLN): so long as $Q(f)$ exists (i.e. $\mathbb{E}[ | f(x) | ]<\infty$), $\hat{Q}_J(f)$ converges to $Q(f)$ almost surely as $J\to\infty$, i.e. |
MC estimators converge at a (slow!) rate that is independent of the dimension $d$; the rate is the inverse square root of the no. of samples taken, i.e. like $J^{-1/2}$.
Let’s examine the random quadrature error
\[E_J(f) := \hat{Q}_J(f) - Q(f),\]which has mean zero and variance
\[\mathbb{E}[|E_J(f)|^2] = \mathbb{E}\!\left[\left|\hat{Q}_J(f)-\mathbb{E}[\hat{Q}_J(f)]\right|^2\right] = \mathbb{V}[\hat{Q}_J(f)] = \frac{1}{J^2}\,J\,\mathbb{V}[f(x)] = \frac{1}{J}\,\mathbb{V}[f(x)].\]i.e. the root mean square error is
\[\sqrt{\mathbb{E}[|E_J(f)|^2]} = \frac{\sqrt{\mathbb{V}[f(x)]}}{\sqrt{J}}.\](For bounded $f$, one can often do much better using tools from concentration of measure.)
We can also obtain a probabilistic error bound using Chebyshev’s inequality: for any $\varepsilon>0$,
\[\mathbb{P}\!\left[|E_J(f)|>\varepsilon\right] = \mathbb{P}\!\left[|\hat{Q}_J(f)-Q(f)|>\varepsilon\right] \le \frac{\mathbb{V}[\hat{Q}_J(f)]}{\varepsilon^2} = \frac{\mathbb{V}[f(x)]}{J\varepsilon^2}.\]The $J\varepsilon^2$ part highlights the basic inverse sq. root cost-accuracy tradeoff in MC methods: if you want $\varepsilon/2$ instead of $\varepsilon$ (i.e. to double the accuracy), then you must quadruple $J$ since
\[4 = \frac{1}{(1/2)^2}.\]This is independent of the dimension of integration!
But how can we apply the MC method if we cannot easily draw samples from $\mu$ / $\rho$. This spurs the Markov chain Monte Carlo (MCMC) methods we consider next.
Note that because
\[\hat{Q}_J(f):=\frac{1}{J}\sum_{j=1}^J f(z_j)\]uses only the values $f(z_j)$ and does not care about the spacing of the nodes $z_j$, the performance of $\hat{Q}_J$ is robust to poor features of $f$ (roughness, high-dim domain, …), but it is also unable to leverage any nice features of $f$ (e.g. smoothness).
$\to$ quasi-Monte Carlo methods try to have the best of both worlds.
Markov chain Monte Carlo
Setup: We have a formula for evaluating the target dens. $\rho$ but drawing samples from $\rho$ is hard/impossible. Our idea is to construct a random walk (more precisely, a Markov chain) of states $z_1,z_2,\ldots,z_t,\ldots$ in $\mathbb{R}^d$ such that
-
as $t\to\infty$, the distribution of $z_t$ should approach/converge to the target dist. $\mu$ / its PDF $\rho$;
-
the correlation between successive states should be small, so $\mathrm{Cov}[z_t,z_{t+1}]\to 0$ (quickly?) as $t\to\infty$, so that moderately spaced states are as good as being indep. samples from the target;
-
we want to approximate integrals:
(space average $=$ time average)
How do we construct such sequences $(z_t)_t$? We use a transition kernel / Markov kernel
\[\chi(x,A) := \mathbb{P}[z_{t+1}\in A \mid z_t=x]\]telling us the prob. dist. of the next state given the current one.
One minimal requirement (necessary but not sufficient for (i)–(iii)) is that the target $\mu$ be invariant under $\chi$, i.e. that if $z_t\sim\mu$, then $z_{t+1}\sim\chi(z_t,\cdot)$ should be $\mu$-distributed as well:
\[\int_{\mathbb{R}^d}\chi(x,A)\,\mu(dx) = \mu(A) \qquad\text{for all measurable }A\subseteq\mathbb{R}^d.\]In practice, to construct a $\chi$ that has $\mu$ as an invariant/equilibrium distn., we often construct a $\chi$ that satisfies detailed balance / reversibility w.r.t. $\mu$:
\[\int_A \chi(x,B)\,\mu(dx) = \int_B \chi(x,A)\,\mu(dx) \qquad\text{for all measurable }A,B\subseteq\mathbb{R}^d.\]Proof that reversibility $\Rightarrow$ invariance. Suppose that $\chi$ is $\mu$-reversible, and let $A\subseteq\mathbb{R}^d$.
\[\int_{\mathbb{R}^d}\chi(x,A)\,\mu(dx) = \int_A \chi(x,\mathbb{R}^d)\,\mu(dx) = \int_A 1\,\mu(dx) = \mu(A).\]$\square$
The Metropolis–Hastings (MH) scheme
The Metropolis–Hastings (MH) scheme offers a simple implementation of a $\mu$-reversible (and hence $\mu$-invariant) transition kernel $\chi$, in two steps.
We need one further prob. distn., called the proposal distribution $Q(x,A)$ with PDF $q(x,x’)$.
-
Given that $z_t=x$, draw a proposed new state $z’$ with PDF $q(x,\cdot)$. (here we need the proposal to be quick and easy to sample)
-
Calculate the acceptance probability $\alpha(z_t,z’)$, where
In practice, we HARD-code $\log\rho$’s, $\log q$’s, etc, and compare $\log z$ vs $\log u$ to $\log\alpha$.
- Draw $z\sim\mathrm{Unif}([0,1])$.
- if $z\le \alpha(z_t,z’)$, then we accept the proposal and set $z_{t+1}:=z’$;
- if $z>\alpha(z_t,z’)$, then we reject the proposal and set $z_{t+1}:=z_t$.
FACT: The MH scheme always yields a kernel $\chi_{\mathrm{MH}}$, i.e. a rule for sampling $z_{t+1}$ given $z_t$, that is reversible w.r.t. $\mu$ and hence has the target as an invariant distn.