Multivariate normal distribution maximizes differential entropy for fixed covariance

Index: The Book of Statistical Proofs ▷ Probability Distributions ▷ Multivariate continuous distributions ▷ Multivariate normal distribution ▷ Maximum entropy distribution

Theorem: The multivariate normal distribution maximizes differential entropy for a random vector with fixed covariance matrix.

Proof: For a random vector $X$ with set of possible values $\mathcal{X}$ and probability density function $p(x)$, the differential entropy is defined as:

\[\label{eq:dent} \mathrm{h}(X) = - \int_{\mathcal{X}} p(x) \log p(x) \, \mathrm{d}x \; .\]

Let $g(x)$ be the probability density function of a multivariate normal distribution with mean $\mu$ and covariance $\Sigma$ and let $f(x)$ be an arbitrary probability density function with the same covariance. Since differential entropy is translation-invariant, we can assume that $f(x)$ has the same mean as $g(x)$.

Consider the Kullback-Leibler divergence of distribution $f(x)$ from distribution $g(x)$ which is non-negative:

\[\label{eq:kl-fg} \begin{split} 0 \leq \mathrm{KL}[f||g] &= \int_{\mathcal{X}} f(x) \log \frac{f(x)}{g(x)} \, \mathrm{d}x \\ &= \int_{\mathcal{X}} f(x) \log f(x) \, \mathrm{d}x - \int_{\mathcal{X}} f(x) \log g(x) \, \mathrm{d}x \\ &\overset{\eqref{eq:dent}}{=} - \mathrm{h}[f(x)] - \int_{\mathcal{X}} f(x) \log g(x) \, \mathrm{d}x \; . \end{split}\]

By plugging the probability density function of the multivariate normal distribution into the second term, we obtain:

\[\label{eq:int-fg-s1} \begin{split} \int_{\mathcal{X}} f(x) \log g(x) \, \mathrm{d}x &= \int_{\mathcal{X}} f(x) \log \left( \frac{1}{\sqrt{(2 \pi)^n |\Sigma|}} \cdot \exp \left[ -\frac{1}{2} (x-\mu)^\mathrm{T} \Sigma^{-1} (x-\mu) \right] \right) \, \mathrm{d}x \\ &= \int_{\mathcal{X}} f(x) \log \left( \frac{1}{\sqrt{(2 \pi)^n |\Sigma|}} \right) \, \mathrm{d}x + \int_{\mathcal{X}} f(x) \log \left( \exp \left[ -\frac{1}{2} (x-\mu)^\mathrm{T} \Sigma^{-1} (x-\mu) \right] \right) \, \mathrm{d}x \\ &= \left( - \frac{n}{2} \log (2 \pi) - \frac{1}{2} \log |\Sigma| \right) \int_{\mathcal{X}} f(x) \, \mathrm{d}x - \frac{1}{2} \int_{\mathcal{X}} f(x) \left[ (x-\mu)^\mathrm{T} \Sigma^{-1} (x-\mu) \right] \, \mathrm{d}x \; . \end{split}\]

Because the entire integral over a probability density function is one and with the definition of the expected value, this becomes:

\[\label{eq:int-fg-s2} \int_{\mathcal{X}} f(x) \log g(x) \, \mathrm{d}x = - \frac{n}{2} \log (2 \pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2} \left\langle (x-\mu)^\mathrm{T} \Sigma^{-1} (x-\mu) \right\rangle_{f(x)} \; .\]

Using the expectation of a trace and the definition of the covariance matrix, the second term can be developed as follows:

\[\label{eq:int-fg-s3} \begin{split} \left\langle (x-\mu)^\mathrm{T} \Sigma^{-1} (x-\mu) \right\rangle_{f(x)} &= \left\langle \mathrm{tr} \left[ (x-\mu)^\mathrm{T} \Sigma^{-1} (x-\mu) \right] \right\rangle_{f(x)} \\ &= \left\langle \mathrm{tr} \left[ \Sigma^{-1} (x-\mu) (x-\mu)^\mathrm{T} \right] \right\rangle_{f(x)} \\ &= \mathrm{tr} \left[ \Sigma^{-1} \left\langle (x-\mu) (x-\mu)^\mathrm{T} \right\rangle_{f(x)} \right] \\ &= \mathrm{tr} \left[ \Sigma^{-1} \Sigma \right] \\ &= \mathrm{tr} \left[ I_n \right] \\ &= n \; . \end{split}\]

Thus, the second term in \eqref{eq:kl-fg} is equal to

\[\label{eq:int-fg-s4} \int_{\mathcal{X}} f(x) \log g(x) \, \mathrm{d}x = - \frac{n}{2} \log (2 \pi) - \frac{1}{2} \log |\Sigma| - \frac{n}{2} \; .\]

This is actually the negative of the differential entropy of the multivariate normal distribution, such that:

\[\label{eq:int-fg-s5} \int_{\mathcal{X}} f(x) \log g(x) \, \mathrm{d}x = -\mathrm{h}[\mathcal{N}(\mu,\Sigma)] = -\mathrm{h}[g(x)] \; .\]

Combining \eqref{eq:kl-fg} with \eqref{eq:int-fg-s5}, we can show that

\[\label{eq:mvn-maxent} \begin{split} 0 &\leq \mathrm{KL}[f||g] \\ 0 &\leq - \mathrm{h}[f(x)] - \left( -\mathrm{h}[g(x)] \right) \\ \mathrm{h}[g(x)] &\geq \mathrm{h}[f(x)] \end{split}\]

which means that the differential entropy of the multivariate normal distribution $\mathcal{N}(\mu, \Sigma)$ will be larger than or equal to any other distribution with the same covariance matrix $\Sigma$.

∎

Sources:

Wikipedia (2025): "Maximum entropy probability distribution"; in: Wikipedia, the free encyclopedia, retrieved on 2025-05-23; URL: https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution#Other_examples.

Metadata: ID: P500 | shortcut: mvn-maxent | author: JoramSoch | date: 2025-05-21, 14:24.