Proofs are printed in boldDefinitions are set in italics
Proofs: by Number, by TopicDefinitions: by Number, by Topic
Specials: General Theorems, Probability Distributions, Statistical Models, Model Selection Criteria

### Chapter I: General Theorems

1. Probability theory

1.1. Random experiments
1.1.1. Random experiment
1.1.2. Sample space
1.1.3. Event space
1.1.4. Probability space

1.2. Random variables
1.2.1. Random event
1.2.2. Random variable
1.2.3. Random vector
1.2.4. Random matrix
1.2.5. Constant
1.2.6. Discrete vs. continuous
1.2.7. Univariate vs. multivariate

1.3. Probability
1.3.1. Probability
1.3.2. Joint probability
1.3.3. Marginal probability
1.3.4. Conditional probability
1.3.5. Exceedance probability
1.3.6. Statistical independence
1.3.7. Conditional independence
1.3.8. Probability under independence
1.3.9. Mutual exclusivity
1.3.10. Probability under exclusivity

1.4. Probability axioms
1.4.1. Axioms of probability
1.4.2. Monotonicity of probability
1.4.3. Probability of the empty set
1.4.4. Probability of the complement
1.4.5. Range of probability
1.4.7. Law of total probability
1.4.8. Probability of exhaustive events (1)
1.4.9. Probability of exhaustive events (2)

1.5. Probability distributions
1.5.1. Probability distribution
1.5.2. Joint distribution
1.5.3. Marginal distribution
1.5.4. Conditional distribution
1.5.5. Sampling distribution

1.6. Probability mass function
1.6.1. Definition
1.6.2. Probability mass function of sum of independents
1.6.3. Probability mass function of strictly increasing function
1.6.4. Probability mass function of strictly decreasing function
1.6.5. Probability mass function of invertible function

1.10. Expected value
1.10.1. Definition
1.10.2. Sample mean
1.10.3. Non-negative random variable
1.10.4. Non-negativity
1.10.5. Linearity
1.10.6. Monotonicity
1.10.7. (Non-)Multiplicativity
1.10.8. Expectation of a trace
1.10.9. Expectation of a quadratic form
1.10.10. Squared expectation of a product
1.10.11. Law of total expectation
1.10.12. Law of the unconscious statistician
1.10.13. Expected value of a random vector
1.10.14. Expected value of a random matrix

1.11. Variance
1.11.1. Definition
1.11.2. Sample variance
1.11.3. Partition into expected values
1.11.4. Non-negativity
1.11.5. Variance of a constant
1.11.7. Scaling upon multiplication
1.11.8. Variance of a sum
1.11.9. Variance of linear combination
1.11.11. Law of total variance
1.11.12. Precision

1.12. Skewness
1.12.1. Definition
1.12.2. Sample skewness
1.12.3. Partition into expected values

1.13. Covariance
1.13.1. Definition
1.13.2. Sample covariance
1.13.3. Partition into expected values
1.13.4. Symmetry
1.13.5. Self-covariance
1.13.6. Covariance under independence
1.13.7. Relationship to correlation
1.13.8. Law of total covariance
1.13.9. Covariance matrix
1.13.10. Sample covariance matrix
1.13.11. Covariance matrix and expected values
1.13.12. Symmetry
1.13.13. Positive semi-definiteness
1.13.14. Invariance under addition of vector
1.13.15. Scaling upon multiplication with matrix
1.13.16. Cross-covariance matrix
1.13.17. Covariance matrix of a sum
1.13.18. Covariance matrix and correlation matrix
1.13.19. Precision matrix
1.13.20. Precision matrix and correlation matrix

1.14. Correlation
1.14.1. Definition
1.14.2. Range
1.14.3. Sample correlation coefficient
1.14.4. Relationship to standard scores
1.14.5. Correlation matrix
1.14.6. Sample correlation matrix

1.15. Measures of central tendency
1.15.1. Median
1.15.2. Mode

1.16. Measures of statistical dispersion
1.16.1. Standard deviation
1.16.2. Full width at half maximum

1.17. Further summary statistics
1.17.1. Minimum
1.17.2. Maximum

1.18. Further moments
1.18.1. Moment
1.18.2. Moment in terms of moment-generating function
1.18.3. Raw moment
1.18.4. First raw moment is mean
1.18.5. Second raw moment and variance
1.18.6. Central moment
1.18.7. First central moment is zero
1.18.8. Second central moment is variance
1.18.9. Standardized moment

2. Information theory

2.1. Shannon entropy
2.1.1. Definition
2.1.2. Non-negativity
2.1.3. Concavity
2.1.4. Conditional entropy
2.1.5. Joint entropy
2.1.6. Cross-entropy
2.1.7. Convexity of cross-entropy
2.1.8. Gibbs’ inequality
2.1.9. Log sum inequality

2.2. Differential entropy
2.2.1. Definition
2.2.2. Negativity
2.2.6. Non-invariance and transformation
2.2.7. Conditional differential entropy
2.2.8. Joint differential entropy
2.2.9. Differential cross-entropy

2.3. Discrete mutual information
2.3.1. Definition
2.3.2. Relation to marginal and conditional entropy
2.3.3. Relation to marginal and joint entropy
2.3.4. Relation to joint and conditional entropy

2.4. Continuous mutual information
2.4.1. Definition
2.4.2. Relation to marginal and conditional differential entropy
2.4.3. Relation to marginal and joint differential entropy
2.4.4. Relation to joint and conditional differential entropy

2.5. Kullback-Leibler divergence
2.5.1. Definition
2.5.2. Non-negativity (1)
2.5.3. Non-negativity (2)
2.5.4. Non-symmetry
2.5.5. Convexity
2.5.7. Invariance under parameter transformation
2.5.8. Relation to discrete entropy
2.5.9. Relation to differential entropy

3. Estimation theory

3.1. Point estimates
3.1.1. Mean squared error
3.1.2. Partition of the mean squared error into bias and variance

3.2. Interval estimates
3.2.1. Confidence interval
3.2.2. Construction of confidence intervals using Wilks’ theorem

4. Frequentist statistics

4.1. Likelihood theory
4.1.1. Likelihood function
4.1.2. Log-likelihood function
4.1.3. Maximum likelihood estimation
4.1.4. MLE can be biased
4.1.5. Maximum log-likelihood
4.1.6. Method of moments

4.2. Statistical hypotheses
4.2.1. Statistical hypothesis
4.2.2. Simple vs. composite
4.2.3. Point/exact vs. set/inexact
4.2.4. One-tailed vs. two-tailed

4.3. Hypothesis testing
4.3.1. Statistical test
4.3.2. Null hypothesis
4.3.3. Alternative hypothesis
4.3.4. One-tailed vs. two-tailed
4.3.5. Test statistic
4.3.6. Size of a test
4.3.7. Power of a test
4.3.8. Significance level
4.3.9. Critical value
4.3.10. p-value
4.3.11. Distribution of p-value under null hypothesis

5. Bayesian statistics

5.1. Probabilistic modeling
5.1.1. Generative model
5.1.2. Likelihood function
5.1.3. Prior distribution
5.1.4. Full probability model
5.1.5. Joint likelihood
5.1.6. Joint likelihood is product of likelihood and prior
5.1.7. Posterior distribution
5.1.8. Maximum-a-posteriori estimation
5.1.9. Posterior density is proportional to joint likelihood
5.1.10. Combined posterior distribution from independent data
5.1.11. Marginal likelihood
5.1.12. Marginal likelihood is integral of joint likelihood

5.2. Prior distributions
5.2.1. Flat vs. hard vs. soft
5.2.2. Uniform vs. non-uniform
5.2.3. Informative vs. non-informative
5.2.4. Empirical vs. non-empirical
5.2.5. Conjugate vs. non-conjugate
5.2.6. Maximum entropy priors
5.2.7. Empirical Bayes priors
5.2.8. Reference priors

5.3. Bayesian inference
5.3.1. Bayes’ theorem
5.3.2. Bayes’ rule
5.3.3. Empirical Bayes
5.3.4. Variational Bayes

6. Machine learning

6.1. Scoring rules
6.1.1. Scoring rule
6.1.2. Proper scoring rule
6.1.3. Strictly proper scoring rule
6.1.4. Log probability scoring rule
6.1.5. Log probability is strictly proper scoring rule
6.1.6. Brier scoring rule
6.1.7. Brier scoring rule is strictly proper scoring rule

### Chapter II: Probability Distributions

1. Univariate discrete distributions

1.1. Discrete uniform distribution
1.1.1. Definition
1.1.2. Probability mass function
1.1.3. Cumulative distribution function
1.1.4. Quantile function
1.1.5. Shannon entropy
1.1.6. Kullback-Leibler divergence
1.1.7. Maximum entropy distribution

1.2. Bernoulli distribution
1.2.1. Definition
1.2.2. Probability mass function
1.2.3. Mean
1.2.4. Variance
1.2.5. Range of variance
1.2.6. Shannon entropy
1.2.7. Kullback-Leibler divergence

1.3. Binomial distribution
1.3.1. Definition
1.3.2. Probability mass function
1.3.3. Probability-generating function
1.3.4. Mean
1.3.5. Variance
1.3.6. Range of variance
1.3.7. Shannon entropy
1.3.8. Kullback-Leibler divergence
1.3.9. Conditional binomial

1.4. Beta-binomial distribution
1.4.1. Definition
1.4.2. Probability mass function
1.4.3. Probability mass function in terms of gamma function
1.4.4. Cumulative distribution function

1.5. Poisson distribution
1.5.1. Definition
1.5.2. Probability mass function
1.5.3. Mean
1.5.4. Variance

2. Multivariate discrete distributions

2.1. Categorical distribution
2.1.1. Definition
2.1.2. Probability mass function
2.1.3. Mean
2.1.4. Covariance
2.1.5. Shannon entropy

2.2. Multinomial distribution
2.2.1. Definition
2.2.2. Probability mass function
2.2.3. Mean
2.2.4. Covariance
2.2.5. Shannon entropy

3. Univariate continuous distributions

3.1. Continuous uniform distribution
3.1.1. Definition
3.1.2. Standard uniform distribution
3.1.3. Probability density function
3.1.4. Cumulative distribution function
3.1.5. Quantile function
3.1.6. Mean
3.1.7. Median
3.1.8. Mode
3.1.9. Variance
3.1.10. Differential entropy
3.1.11. Kullback-Leibler divergence
3.1.12. Maximum entropy distribution

3.2. Normal distribution
3.2.1. Definition
3.2.2. Special case of multivariate normal distribution
3.2.3. Standard normal distribution
3.2.4. Relationship to standard normal distribution (1)
3.2.5. Relationship to standard normal distribution (2)
3.2.6. Relationship to standard normal distribution (3)
3.2.7. Relationship to chi-squared distribution
3.2.8. Relationship to t-distribution
3.2.9. Gaussian integral
3.2.10. Probability density function
3.2.11. Moment-generating function
3.2.12. Cumulative distribution function
3.2.13. Cumulative distribution function without error function
3.2.14. Probability of being within standard deviations from mean
3.2.15. Quantile function
3.2.16. Mean
3.2.17. Median
3.2.18. Mode
3.2.19. Variance
3.2.20. Full width at half maximum
3.2.21. Extreme points
3.2.22. Inflection points
3.2.23. Differential entropy
3.2.24. Kullback-Leibler divergence
3.2.25. Maximum entropy distribution
3.2.26. Linear combination

3.3. t-distribution
3.3.1. Definition
3.3.2. Special case of multivariate t-distribution
3.3.3. Non-standardized t-distribution
3.3.4. Relationship to non-standardized t-distribution
3.3.5. Probability density function

3.4. Gamma distribution
3.4.1. Definition
3.4.2. Special case of Wishart distribution
3.4.3. Standard gamma distribution
3.4.4. Relationship to standard gamma distribution (1)
3.4.5. Relationship to standard gamma distribution (2)
3.4.6. Scaling of a gamma random variable
3.4.7. Probability density function
3.4.8. Moment-generating function
3.4.9. Cumulative distribution function
3.4.10. Quantile function
3.4.11. Mean
3.4.12. Variance
3.4.13. Logarithmic expectation
3.4.14. Expectation of x ln x
3.4.15. Differential entropy
3.4.16. Kullback-Leibler divergence

3.5. Exponential distribution
3.5.1. Definition
3.5.2. Special case of gamma distribution
3.5.3. Probability density function
3.5.4. Moment-generating function
3.5.5. Cumulative distribution function
3.5.6. Quantile function
3.5.7. Mean
3.5.8. Median
3.5.9. Mode
3.5.10. Variance
3.5.11. Skewness

3.6. Log-normal distribution
3.6.1. Definition
3.6.2. Probability density function
3.6.3. Cumulative distribution function
3.6.4. Quantile function
3.6.5. Mean
3.6.6. Median
3.6.7. Mode
3.6.8. Variance

3.7. Chi-squared distribution
3.7.1. Definition
3.7.2. Special case of gamma distribution
3.7.3. Probability density function
3.7.4. Moments

3.8. F-distribution
3.8.1. Definition
3.8.2. Probability density function

3.9. Beta distribution
3.9.1. Definition
3.9.2. Relationship to chi-squared distribution
3.9.3. Probability density function
3.9.4. Moment-generating function
3.9.5. Cumulative distribution function
3.9.6. Mean
3.9.7. Variance

3.10. Wald distribution
3.10.1. Definition
3.10.2. Probability density function
3.10.3. Moment-generating function
3.10.4. Mean
3.10.5. Variance
3.10.6. Skewness
3.10.7. Method of moments

3.11. ex-Gaussian distribution
3.11.1. Definition
3.11.2. Probability density function
3.11.3. Moment-generating function
3.11.4. Mean
3.11.5. Variance
3.11.6. Skewness
3.11.7. Method of moments

4. Multivariate continuous distributions

4.1. Multivariate normal distribution
4.1.1. Definition
4.1.2. Special case of matrix-normal distribution
4.1.3. Relationship to chi-squared distribution
4.1.4. Bivariate normal distribution
4.1.5. Probability density function of the bivariate normal distribution
4.1.6. Probability density function in terms of correlation coefficient
4.1.7. Probability density function
4.1.8. Moment-generating function
4.1.9. Mean
4.1.10. Covariance
4.1.11. Differential entropy
4.1.12. Kullback-Leibler divergence
4.1.13. Linear transformation
4.1.14. Marginal distributions
4.1.15. Conditional distributions
4.1.16. Conditions for independence
4.1.17. Independence of products

4.2. Multivariate t-distribution
4.2.1. Definition
4.2.2. Probability density function
4.2.3. Relationship to F-distribution

4.3. Normal-gamma distribution
4.3.1. Definition
4.3.2. Special case of normal-Wishart distribution
4.3.3. Probability density function
4.3.4. Mean
4.3.5. Covariance
4.3.6. Differential entropy
4.3.7. Kullback-Leibler divergence
4.3.8. Marginal distributions
4.3.9. Conditional distributions
4.3.10. Drawing samples

4.4. Dirichlet distribution
4.4.1. Definition
4.4.2. Probability density function
4.4.3. Kullback-Leibler divergence
4.4.4. Exceedance probabilities

5. Matrix-variate continuous distributions

5.1. Matrix-normal distribution
5.1.1. Definition
5.1.2. Equivalence to multivariate normal distribution
5.1.3. Probability density function
5.1.4. Mean
5.1.5. Covariance
5.1.6. Differential entropy
5.1.7. Kullback-Leibler divergence
5.1.8. Transposition
5.1.9. Linear transformation
5.1.10. Marginal distributions
5.1.11. Drawing samples

5.2. Wishart distribution
5.2.1. Definition
5.2.2. Kullback-Leibler divergence

5.3. Normal-Wishart distribution
5.3.1. Definition
5.3.2. Probability density function
5.3.3. Mean

### Chapter III: Statistical Models

1. Univariate normal data

1.1. Univariate Gaussian
1.1.1. Definition
1.1.2. Maximum likelihood estimation
1.1.3. One-sample t-test
1.1.4. Two-sample t-test
1.1.5. Paired t-test
1.1.6. Conjugate prior distribution
1.1.7. Posterior distribution
1.1.8. Log model evidence
1.1.9. Accuracy and complexity

1.2. Univariate Gaussian with known variance
1.2.1. Definition
1.2.2. Maximum likelihood estimation
1.2.3. One-sample z-test
1.2.4. Two-sample z-test
1.2.5. Paired z-test
1.2.6. Conjugate prior distribution
1.2.7. Posterior distribution
1.2.8. Log model evidence
1.2.9. Accuracy and complexity
1.2.10. Log Bayes factor
1.2.11. Expectation of log Bayes factor
1.2.12. Cross-validated log model evidence
1.2.13. Cross-validated log Bayes factor
1.2.14. Expectation of cross-validated log Bayes factor

1.4. Simple linear regression
1.4.1. Definition
1.4.2. Special case of multiple linear regression
1.4.3. Ordinary least squares (1)
1.4.4. Ordinary least squares (2)
1.4.5. Expectation of estimates
1.4.6. Variance of estimates
1.4.7. Distribution of estimates
1.4.8. Correlation of estimates
1.4.9. Effects of mean-centering
1.4.10. Regression line
1.4.11. Regression line includes center of mass
1.4.12. Projection of data point to regression line
1.4.13. Sums of squares
1.4.14. Transformation matrices
1.4.15. Weighted least squares (1)
1.4.16. Weighted least squares (2)
1.4.17. Maximum likelihood estimation (1)
1.4.18. Maximum likelihood estimation (2)
1.4.19. t-test for intercept parameter
1.4.20. t-test for slope parameter
1.4.21. Sum of residuals is zero
1.4.22. Correlation with covariate is zero
1.4.23. Residual variance in terms of sample variance
1.4.24. Correlation coefficient in terms of slope estimate
1.4.25. Coefficient of determination in terms of correlation coefficient

1.5. Multiple linear regression
1.5.1. Definition
1.5.2. Special case of general linear model
1.5.3. Ordinary least squares (1)
1.5.4. Ordinary least squares (2)
1.5.5. Ordinary least squares for two regressors
1.5.6. Total sum of squares
1.5.7. Explained sum of squares
1.5.8. Residual sum of squares
1.5.9. Total, explained and residual sum of squares
1.5.10. Estimation matrix
1.5.11. Projection matrix
1.5.12. Residual-forming matrix
1.5.13. Estimation, projection and residual-forming matrix
1.5.14. Symmetry of projection and residual-forming matrix
1.5.15. Idempotence of projection and residual-forming matrix
1.5.16. Independence of estimated parameters and residuals
1.5.17. Distribution of OLS estimates, signal and residuals
1.5.18. Distribution of WLS estimates, signal and residuals
1.5.19. Distribution of residual sum of squares
1.5.20. Weighted least squares (1)
1.5.21. Weighted least squares (2)
1.5.22. Maximum likelihood estimation
1.5.23. Maximum log-likelihood
1.5.24. t-contrast
1.5.25. F-contrast
1.5.26. Contrast-based t-test
1.5.27. Contrast-based F-test
1.5.28. t-test for single regressor
1.5.29. Deviance function
1.5.30. Akaike information criterion
1.5.31. Bayesian information criterion
1.5.32. Corrected Akaike information criterion

1.7. Bayesian linear regression with known covariance
1.7.1. Conjugate prior distribution
1.7.2. Posterior distribution
1.7.3. Log model evidence
1.7.4. Accuracy and complexity

2. Multivariate normal data

2.1. General linear model
2.1.1. Definition
2.1.2. Ordinary least squares
2.1.3. Weighted least squares
2.1.4. Maximum likelihood estimation

2.2. Transformed general linear model
2.2.1. Definition
2.2.2. Derivation of the distribution
2.2.3. Equivalence of parameter estimates

2.3. Inverse general linear model
2.3.1. Definition
2.3.2. Derivation of the distribution
2.3.3. Best linear unbiased estimator
2.3.4. Corresponding forward model
2.3.5. Derivation of parameters
2.3.6. Proof of existence

2.4. Multivariate Bayesian linear regression
2.4.1. Conjugate prior distribution
2.4.2. Posterior distribution
2.4.3. Log model evidence

3. Count data

3.1. Binomial observations
3.1.1. Definition
3.1.2. Binomial test
3.1.3. Maximum likelihood estimation
3.1.4. Maximum log-likelihood
3.1.5. Maximum-a-posteriori estimation
3.1.6. Conjugate prior distribution
3.1.7. Posterior distribution
3.1.8. Log model evidence
3.1.9. Log Bayes factor
3.1.10. Posterior probability

3.2. Multinomial observations
3.2.1. Definition
3.2.2. Multinomial test
3.2.3. Maximum likelihood estimation
3.2.4. Maximum log-likelihood
3.2.5. Maximum-a-posteriori estimation
3.2.6. Conjugate prior distribution
3.2.7. Posterior distribution
3.2.8. Log model evidence
3.2.9. Log Bayes factor
3.2.10. Posterior probability

3.3. Poisson-distributed data
3.3.1. Definition
3.3.2. Maximum likelihood estimation
3.3.3. Conjugate prior distribution
3.3.4. Posterior distribution
3.3.5. Log model evidence

3.4. Poisson distribution with exposure values
3.4.1. Definition
3.4.2. Maximum likelihood estimation
3.4.3. Conjugate prior distribution
3.4.4. Posterior distribution
3.4.5. Log model evidence

4. Frequency data

4.1. Beta-distributed data
4.1.1. Definition
4.1.2. Method of moments

4.2. Dirichlet-distributed data
4.2.1. Definition
4.2.2. Maximum likelihood estimation

4.3. Beta-binomial data
4.3.1. Definition
4.3.2. Method of moments

5. Categorical data

5.1. Logistic regression
5.1.1. Definition
5.1.2. Probability and log-odds
5.1.3. Log-odds and probability

### Chapter IV: Model Selection

1. Goodness-of-fit measures

1.1. Residual variance
1.1.1. Definition
1.1.2. Maximum likelihood estimator is biased (p = 1)
1.1.3. Maximum likelihood estimator is biased (p > 1)
1.1.4. Construction of unbiased estimator (p = 1)
1.1.5. Construction of unbiased estimator (p > 1)

1.2. R-squared
1.2.1. Definition
1.2.2. Derivation of R² and adjusted R²
1.2.3. Relationship to residual variance
1.2.4. Relationship to maximum log-likelihood
1.2.5. Statistical significance test for R²

1.3. F-statistic
1.3.1. Definition
1.3.2. Relationship to coefficient of determination
1.3.3. Relationship to maximum log-likelihood

1.4. Signal-to-noise ratio
1.4.1. Definition
1.4.2. Relationship to coefficient of determination
1.4.3. Relationship to maximum log-likelihood

2. Classical information criteria

2.1. Akaike information criterion
2.1.1. Definition
2.1.2. Corrected AIC
2.1.3. Corrected AIC and uncorrected AIC
2.1.4. Corrected AIC and maximum log-likelihood

2.2. Bayesian information criterion
2.2.1. Definition
2.2.2. Derivation

2.3. Deviance information criterion
2.3.1. Definition
2.3.2. Deviance

3. Bayesian model selection

3.1. Model evidence
3.1.1. Definition
3.1.2. Derivation
3.1.3. Log model evidence
3.1.4. Derivation of the log model evidence
3.1.5. Expression using prior and posterior
3.1.6. Partition into accuracy and complexity
3.1.7. Subtraction of mean from LMEs
3.1.8. Uniform-prior log model evidence
3.1.9. Cross-validated log model evidence
3.1.10. Empirical Bayesian log model evidence
3.1.11. Variational Bayesian log model evidence

3.2. Family evidence
3.2.1. Definition
3.2.2. Derivation
3.2.3. Log family evidence
3.2.4. Derivation of the log family evidence
3.2.5. Calculation from log model evidences
3.2.6. Approximation of log family evidences

3.3. Bayes factor
3.3.1. Definition
3.3.2. Transitivity
3.3.3. Computation using Savage-Dickey density ratio
3.3.4. Computation using encompassing prior method
3.3.5. Encompassing model
3.3.6. Log Bayes factor
3.3.7. Derivation of the log Bayes factor
3.3.8. Calculation from log model evidences

3.4. Posterior model probability
3.4.1. Definition
3.4.2. Derivation
3.4.3. Calculation from Bayes factors
3.4.4. Calculation from log Bayes factor
3.4.5. Calculation from log model evidences

3.5. Bayesian model averaging
3.5.1. Definition
3.5.2. Derivation
3.5.3. Calculation from log model evidences