Index: The Book of Statistical ProofsGeneral TheoremsBayesian statisticsBayesian inference ▷ Decomposition of the free energy

Theorem: Let $m$ be a generative model with likelihood function $p(y \vert \theta,m) = p(y \vert \theta)$, prior distribution $p(\theta \vert m) = p(\theta)$, joint likelihood $p(y,\theta \vert m) = p(y,\theta)$, true posterior distribution $p(\theta \vert y,m) = p(\theta \vert y)$ and approximate posterior distribution $q(\theta \vert m) = q(\theta) \approx p(\theta \vert y)$. Then, the variational free energy can be decomposed as

1) the difference between log model evidence and KL divergence of approximate from true posterior

\[\label{eq:vb-fe1} \mathrm{F}[q(\theta)] = \log p(y) - \mathrm{KL}[q(\theta) || p(\theta \vert y)] \; ,\]

2) the difference of expected log-likelihood and KL divergence of approximate posterior from prior

\[\label{eq:vb-fe2} \mathrm{F}[q(\theta)] = \left\langle \log p(y \vert \theta) \right\rangle_{q(\theta)} - \mathrm{KL}[q(\theta) || p(\theta)]\]

3) the sum of expected joint log-likelihood and approximate posterior differential entropy

\[\label{eq:vb-fe3} \mathrm{F}[q(\theta)] = \left\langle \log p(y,\theta) \right\rangle_{q(\theta)} - \mathrm{h}[q(\theta)]\]

where $p(y \vert m) = p(y)$ is the marginal likelihood, $\left\langle \cdot \right\rangle_{p(x)}$ denotes an expectation with respect to the []density](/D/pdf) $p(x)$, $\mathrm{KL}[\cdot \vert\vert \cdot]$ denotes the Kullback-Leibler divergence and $\mathrm{h}[\cdot]$ denotes the differential entropy.

Proof: The log model evidence is defined as

\[\label{eq:lme} \log p(y) = \log \int_{\Theta} p(y,\theta) \, \mathrm{d}\theta\]

and using the approximate posterior density $q(\theta)$, this can be rewritten as

\[\label{eq:lme-eq} \begin{split} \log p(y) &= \log \int_{\Theta} \frac{q(\theta)}{q(\theta)} \, \mathrm{d}\theta \\ &= \log \left\langle \frac{p(y,\theta)}{q(\theta)} \right\rangle_{q(\theta)} \; . \end{split}\]

By Jensen’s inequality and because the logarithm is a concave function, we have:

\[\label{eq:lme-ineq} \log \left\langle \frac{p(y,\theta)}{q(\theta)} \right\rangle_{q(\theta)} \geq \left\langle \log \frac{p(y,\theta)}{q(\theta)} \right\rangle_{q(\theta)} \; .\]

The right-hand side of this equation is referred to as the free energy:

\[\label{eq:fe} \mathrm{F}[q(\theta)] = \left\langle \log \frac{p(y,\theta)}{q(\theta)} \right\rangle_{q(\theta)} \; .\]

With that, we can show the above identites:

1) the difference between log model evidence and KL divergence of approximate from true posterior

\[\label{eq:vb-fe1-qed} \begin{split} \mathrm{F}[q(\theta)] &= \left\langle \log \frac{p(y,\theta)}{q(\theta)} \right\rangle_{q(\theta)} \\ &= \left\langle \log \frac{p(\theta \vert y) p(y)}{q(\theta)} \right\rangle_{q(\theta)} \\ &= \left\langle \log p(y) - \log \frac{q(\theta)}{p(\theta \vert y)} \right\rangle_{q(\theta)} \\ &= \left\langle \log p(y) \right\rangle_{q(\theta)} - \int_{\Theta} q(\theta) \log \frac{q(\theta)}{p(\theta \vert y)} \, \mathrm{d}\theta \\ &= \log p(y) - \mathrm{KL}[q(\theta) || p(\theta \vert y)] \end{split}\]

where the first term is the log model evidence (= log marginal likelihood) and the second term can be seen as an approximation error (= divergence of approximate from true posterior distribution);

2) the difference of expected log-likelihood and KL divergence of approximate posterior from prior

\[\label{eq:vb-fe2-qed} \begin{split} \mathrm{F}[q(\theta)] &= \left\langle \log \frac{p(y,\theta)}{q(\theta)} \right\rangle_{q(\theta)} \\ &= \left\langle \log \frac{p(y \vert \theta) p(\theta)}{q(\theta)} \right\rangle_{q(\theta)} \\ &= \left\langle \log p(y \vert \theta) - \log \frac{q(\theta)}{p(\theta)} \right\rangle_{q(\theta)} \\ &= \int_{\Theta} q(\theta) \log p(y \vert \theta) \, \mathrm{d}\theta - \int_{\Theta} q(\theta) \log \frac{q(\theta)}{p(\theta)} \, \mathrm{d}\theta \\ &= \left\langle \log p(y \vert \theta) \right\rangle_{q(\theta)} - \mathrm{KL}[q(\theta) || p(\theta)] \end{split}\]

where the first term can be seen as an accuracy term (= posterior expected log-likelihood) and the second term can be seen as a complexity penalty (= divergence of posterior from prior distribution);

3) the sum of expected joint log-likelihood and approximate posterior entropy

\[\label{eq:vb-fe3-qed} \begin{split} \mathrm{F}[q(\theta)] &= \left\langle \log \frac{p(y,\theta)}{q(\theta)} \right\rangle_{q(\theta)} \\ &= \left\langle \log p(y,\theta) - \log q(\theta) \right\rangle_{q(\theta)} \\ &= \int_{\Theta} q(\theta) \log p(y,\theta) \, \mathrm{d}\theta - \int_{\Theta} q(\theta) \log q(\theta) \, \mathrm{d}\theta \\ &= \left\langle \log p(y,\theta) \right\rangle_{q(\theta)} + \mathrm{h}[q(\theta)] \end{split}\]

where the first term represents the negative energy (= posterior expected joint log-likelihood) and the second term presents the posterior entropy (= differential entropy of the approximate posterior distribution).

Sources:

Metadata: ID: P516 | shortcut: fren-dec | author: JoramSoch | date: 2025-09-25, 10:57.