Proofs are printed in boldDefinitions are set in italics
Proofs: by Number, by TopicDefinitions: by Number, by Topic
Specials: Probability Distributions, Statistical Models, Model Selection Criteria

### Chapter I: General Theorems

1. Probability theory

1.1. Random variables
1.1.1. Random experiment
1.1.2. Random event
1.1.3. Random variable
1.1.4. Random vector
1.1.5. Random matrix
1.1.6. Constant
1.1.7. Discrete vs. continuous
1.1.8. Univariate vs. multivariate

1.2. Probability
1.2.1. Probability
1.2.2. Joint probability
1.2.3. Marginal probability
1.2.4. Conditional probability
1.2.5. Exceedance probability
1.2.6. Statistical independence
1.2.7. Conditional independence
1.2.8. Probability under independence
1.2.9. Mutual exclusivity
1.2.10. Probability under exclusivity

1.3. Probability axioms
1.3.1. Axioms of probability
1.3.2. Monotonicity of probability
1.3.3. Probability of the empty set
1.3.4. Probability of the complement
1.3.5. Range of probability
1.3.7. Law of total probability
1.3.8. Probability of exhaustive events

1.4. Probability distributions
1.4.1. Probability distribution
1.4.2. Joint distribution
1.4.3. Marginal distribution
1.4.4. Conditional distribution
1.4.5. Sampling distribution

1.6. Expected value
1.6.1. Definition
1.6.2. Sample mean
1.6.3. Non-negative random variable
1.6.4. Non-negativity
1.6.5. Linearity
1.6.6. Monotonicity
1.6.7. (Non-)Multiplicativity
1.6.8. Expectation of a quadratic form
1.6.9. Law of the unconscious statistician
1.6.10. Expected value of a random vector
1.6.11. Expected value of a random matrix

1.7. Variance
1.7.1. Definition
1.7.2. Sample variance
1.7.3. Partition into expected values
1.7.4. Non-negativity
1.7.5. Variance of a constant
1.7.7. Scaling upon multiplication
1.7.8. Variance of a sum
1.7.9. Variance of linear combination
1.7.11. Precision

1.8. Covariance
1.8.1. Definition
1.8.2. Sample covariance
1.8.3. Partition into expected values
1.8.4. Covariance under independence
1.8.5. Relationship to correlation
1.8.6. Covariance matrix
1.8.7. Sample covariance matrix
1.8.8. Covariance matrix and expected values
1.8.9. Covariance matrix and correlation matrix
1.8.10. Precision matrix
1.8.11. Precision matrix and correlation matrix

1.9. Correlation
1.9.1. Definition
1.9.2. Correlation matrix

1.10. Measures of central tendency
1.10.1. Median
1.10.2. Mode

1.11. Measures of statistical dispersion
1.11.1. Standard deviation
1.11.2. Full width at half maximum

1.12. Further summary statistics
1.12.1. Minimum
1.12.2. Maximum

1.13. Further moments
1.13.1. Moment
1.13.2. Moment in terms of moment-generating function
1.13.3. Raw moment
1.13.4. First raw moment is mean
1.13.5. Second raw moment and variance
1.13.6. Central moment
1.13.7. First central moment is zero
1.13.8. Second central moment is variance
1.13.9. Standardized moment

2. Information theory

2.1. Shannon entropy
2.1.1. Definition
2.1.2. Non-negativity
2.1.3. Concavity
2.1.4. Conditional entropy
2.1.5. Joint entropy
2.1.6. Cross-entropy
2.1.7. Convexity of cross-entropy
2.1.8. Gibbs’ inequality
2.1.9. Log sum inequality

2.2. Differential entropy
2.2.1. Definition
2.2.2. Negativity
2.2.5. Conditional differential entropy
2.2.6. Joint differential entropy
2.2.7. Differential cross-entropy

2.3. Discrete mutual information
2.3.1. Definition
2.3.2. Relation to marginal and conditional entropy
2.3.3. Relation to marginal and joint entropy
2.3.4. Relation to joint and conditional entropy

2.4. Continuous mutual information
2.4.1. Definition
2.4.2. Relation to marginal and conditional differential entropy
2.4.3. Relation to marginal and joint differential entropy
2.4.4. Relation to joint and conditional differential entropy

2.5. Kullback-Leibler divergence
2.5.1. Definition
2.5.2. Non-negativity (1)
2.5.3. Non-negativity (2)
2.5.4. Non-symmetry
2.5.5. Convexity
2.5.7. Invariance under parameter transformation
2.5.8. Relation to discrete entropy
2.5.9. Relation to differential entropy

3. Estimation theory

3.1. Point estimates
3.1.1. Partition of the mean squared error into bias and variance

3.2. Interval estimates
3.2.1. Construction of confidence intervals using Wilks’ theorem

4. Frequentist statistics

4.1. Likelihood theory
4.1.1. Likelihood function
4.1.2. Log-likelihood function
4.1.3. Maximum likelihood estimation
4.1.4. Maximum log-likelihood
4.1.5. Method of moments

4.2. Statistical hypotheses
4.2.1. Statistical hypothesis
4.2.2. Simple vs. composite
4.2.3. Point/exact vs. set/inexact
4.2.4. One-tailed vs. two-tailed

4.3. Hypothesis testing
4.3.1. Statistical test
4.3.2. Null hypothesis
4.3.3. Alternative hypothesis
4.3.4. One-tailed vs. two-tailed
4.3.5. Test statistic
4.3.6. Size of a test
4.3.7. Power of a test
4.3.8. Significance level
4.3.9. Critical value
4.3.10. p-value

5. Bayesian statistics

5.1. Probabilistic modeling
5.1.1. Generative model
5.1.2. Likelihood function
5.1.3. Prior distribution
5.1.4. Full probability model
5.1.5. Joint likelihood
5.1.6. Joint likelihood is product of likelihood and prior
5.1.7. Posterior distribution
5.1.8. Posterior density is proportional to joint likelihood
5.1.9. Marginal likelihood
5.1.10. Marginal likelihood is integral of joint likelihood

5.2. Prior distributions
5.2.1. Flat vs. hard vs. soft
5.2.2. Uniform vs. non-uniform
5.2.3. Informative vs. non-informative
5.2.4. Empirical vs. non-empirical
5.2.5. Conjugate vs. non-conjugate
5.2.6. Maximum entropy priors
5.2.7. Empirical Bayes priors
5.2.8. Reference priors

5.3. Bayesian inference
5.3.1. Bayes’ theorem
5.3.2. Bayes’ rule
5.3.3. Empirical Bayes
5.3.4. Variational Bayes

### Chapter II: Probability Distributions

1. Univariate discrete distributions

1.1. Discrete uniform distribution
1.1.1. Definition
1.1.2. Probability mass function
1.1.3. Cumulative distribution function
1.1.4. Quantile function

1.2. Bernoulli distribution
1.2.1. Definition
1.2.2. Probability mass function
1.2.3. Mean

1.3. Binomial distribution
1.3.1. Definition
1.3.2. Probability mass function
1.3.3. Mean

1.4. Poisson distribution
1.4.1. Definition
1.4.2. Probability mass function
1.4.3. Mean
1.4.4. Variance

2. Multivariate discrete distributions

2.1. Categorical distribution
2.1.1. Definition
2.1.2. Probability mass function
2.1.3. Mean

2.2. Multinomial distribution
2.2.1. Definition
2.2.2. Probability mass function
2.2.3. Mean

3. Univariate continuous distributions

3.1. Continuous uniform distribution
3.1.1. Definition
3.1.2. Standard uniform distribution
3.1.3. Probability density function
3.1.4. Cumulative distribution function
3.1.5. Quantile function
3.1.6. Mean
3.1.7. Median
3.1.8. Mode

3.2. Normal distribution
3.2.1. Definition
3.2.2. Standard normal distribution
3.2.3. Relationship to standard normal distribution (1)
3.2.4. Relationship to standard normal distribution (2)
3.2.5. Relationship to standard normal distribution (3)
3.2.6. Relationship to chi-squared distribution
3.2.7. Relationship to t-distribution
3.2.8. Gaussian integral
3.2.9. Probability density function
3.2.10. Moment-generating function
3.2.11. Cumulative distribution function
3.2.12. Cumulative distribution function without error function
3.2.13. Quantile function
3.2.14. Mean
3.2.15. Median
3.2.16. Mode
3.2.17. Variance
3.2.18. Full width at half maximum
3.2.19. Extreme points
3.2.20. Inflection points
3.2.21. Differential entropy
3.2.22. Kullback-Leibler divergence
3.2.23. Maximum entropy distribution
3.2.24. Linear combination

3.3. t-distribution
3.3.1. Definition
3.3.2. Non-standardized t-distribution
3.3.3. Relationship to non-standardized t-distribution

3.4. Gamma distribution
3.4.1. Definition
3.4.2. Standard gamma distribution
3.4.3. Relationship to standard gamma distribution (1)
3.4.4. Relationship to standard gamma distribution (2)
3.4.5. Probability density function
3.4.6. Cumulative distribution function
3.4.7. Quantile function
3.4.8. Mean
3.4.9. Variance
3.4.10. Logarithmic expectation
3.4.11. Expectation of x ln x
3.4.12. Differential entropy
3.4.13. Kullback-Leibler divergence

3.5. Exponential distribution
3.5.1. Definition
3.5.2. Special case of gamma distribution
3.5.3. Probability density function
3.5.4. Cumulative distribution function
3.5.5. Quantile function
3.5.6. Mean
3.5.7. Median
3.5.8. Mode

3.6. Chi-squared distribution
3.6.1. Definition
3.6.2. Special case of gamma distribution
3.6.3. Probability density function
3.6.4. Moments

3.7. F-distribution
3.7.1. Definition

3.8. Beta distribution
3.8.1. Definition
3.8.2. Probability density function
3.8.3. Moment-generating function
3.8.4. Cumulative distribution function
3.8.5. Mean
3.8.6. Variance

3.9. Wald distribution
3.9.1. Definition
3.9.2. Probability density function
3.9.3. Moment-generating function
3.9.4. Mean
3.9.5. Variance

4. Multivariate continuous distributions

4.1. Multivariate normal distribution
4.1.1. Definition
4.1.2. Probability density function
4.1.3. Differential entropy
4.1.4. Kullback-Leibler divergence
4.1.5. Linear transformation
4.1.6. Marginal distributions
4.1.7. Conditional distributions
4.1.8. Conditions for independence

4.2. Multivariate t-distribution
4.2.1. Definition
4.2.2. Relationship to F-distribution

4.3. Normal-gamma distribution
4.3.1. Definition
4.3.2. Probability density function
4.3.3. Mean
4.3.4. Differential entropy
4.3.5. Kullback-Leibler divergence
4.3.6. Marginal distributions
4.3.7. Conditional distributions

4.4. Dirichlet distribution
4.4.1. Definition
4.4.2. Probability density function
4.4.3. Exceedance probabilities

5. Matrix-variate continuous distributions

5.1. Matrix-normal distribution
5.1.1. Definition
5.1.2. Probability density function
5.1.3. Equivalence to multivariate normal distribution
5.1.4. Transposition
5.1.5. Linear transformation

5.2. Wishart distribution
5.2.1. Definition

### Chapter III: Statistical Models

1. Univariate normal data

1.1. Univariate Gaussian
1.1.1. Definition
1.1.2. Maximum likelihood estimation
1.1.3. One-sample t-test
1.1.4. Two-sample t-test
1.1.5. Paired t-test
1.1.6. Conjugate prior distribution
1.1.7. Posterior distribution
1.1.8. Log model evidence
1.1.9. Accuracy and complexity

1.2. Univariate Gaussian with known variance
1.2.1. Definition
1.2.2. Maximum likelihood estimation
1.2.3. One-sample z-test
1.2.4. Two-sample z-test
1.2.5. Paired z-test
1.2.6. Conjugate prior distribution
1.2.7. Posterior distribution
1.2.8. Log model evidence
1.2.9. Accuracy and complexity
1.2.10. Log Bayes factor
1.2.11. Expectation of log Bayes factor
1.2.12. Cross-validated log model evidence
1.2.13. Cross-validated log Bayes factor
1.2.14. Expectation of cross-validated log Bayes factor

1.3. Multiple linear regression
1.3.1. Definition
1.3.2. Ordinary least squares (1)
1.3.3. Ordinary least squares (2)
1.3.4. Total sum of squares
1.3.5. Explained sum of squares
1.3.6. Residual sum of squares
1.3.7. Total, explained and residual sum of squares
1.3.8. Estimation matrix
1.3.9. Projection matrix
1.3.10. Residual-forming matrix
1.3.11. Estimation, projection and residual-forming matrix
1.3.12. Idempotence of projection and residual-forming matrix
1.3.13. Weighted least squares (1)
1.3.14. Weighted least squares (2)
1.3.15. Maximum likelihood estimation

1.4. Bayesian linear regression
1.4.1. Conjugate prior distribution
1.4.2. Posterior distribution
1.4.3. Log model evidence
1.4.4. Posterior probability of alternative hypothesis
1.4.5. Posterior credibility region excluding null hypothesis

2. Multivariate normal data

2.1. General linear model
2.1.1. Definition
2.1.2. Ordinary least squares
2.1.3. Weighted least squares
2.1.4. Maximum likelihood estimation

2.2. Multivariate Bayesian linear regression
2.2.1. Conjugate prior distribution
2.2.2. Posterior distribution
2.2.3. Log model evidence

3. Poisson data

3.1. Poisson-distributed data
3.1.1. Definition
3.1.2. Maximum likelihood estimation
3.1.3. Conjugate prior distribution
3.1.4. Posterior distribution
3.1.5. Log model evidence

3.2. Poisson distribution with exposure values
3.2.1. Definition
3.2.2. Maximum likelihood estimation
3.2.3. Conjugate prior distribution
3.2.4. Posterior distribution
3.2.5. Log model evidence

4. Probability data

4.1. Beta-distributed data
4.1.1. Definition
4.1.2. Method of moments

4.2. Dirichlet-distributed data
4.2.1. Definition
4.2.2. Maximum likelihood estimation

5. Categorical data

5.1. Binomial observations
5.1.1. Definition
5.1.2. Conjugate prior distribution
5.1.3. Posterior distribution
5.1.4. Log model evidence

5.2. Multinomial observations
5.2.1. Definition
5.2.2. Conjugate prior distribution
5.2.3. Posterior distribution
5.2.4. Log model evidence

5.3. Logistic regression
5.3.1. Definition
5.3.2. Probability and log-odds
5.3.3. Log-odds and probability

### Chapter IV: Model Selection

1. Goodness-of-fit measures

1.1. Residual variance
1.1.1. Definition
1.1.2. Maximum likelihood estimator is biased
1.1.3. Construction of unbiased estimator

1.2. R-squared
1.2.1. Definition
1.2.2. Derivation of R² and adjusted R²
1.2.3. Relationship to maximum log-likelihood

1.3. Signal-to-noise ratio
1.3.1. Definition
1.3.2. Relationship with R²

2. Classical information criteria

2.1. Akaike information criterion
2.1.1. Definition

2.2. Bayesian information criterion
2.2.1. Definition
2.2.2. Derivation

2.3. Deviance information criterion
2.3.1. Definition

3. Bayesian model selection

3.1. Log model evidence
3.1.1. Definition
3.1.2. Derivation
3.1.3. Partition into accuracy and complexity
3.1.4. Uniform-prior log model evidence
3.1.5. Cross-validated log model evidence
3.1.6. Empirical Bayesian log model evidence
3.1.7. Variational Bayesian log model evidence

3.2. Log family evidence
3.2.1. Definition
3.2.2. Derivation
3.2.3. Calculation from log model evidences

3.3. Log Bayes factor
3.3.1. Definition
3.3.2. Derivation
3.3.3. Calculation from log model evidences

3.4. Bayes factor
3.4.1. Definition
3.4.2. Transitivity
3.4.3. Computation using Savage-Dickey Density Ratio
3.4.4. Computation using Encompassing Prior Method
3.4.5. Encompassing model

3.5. Posterior model probability
3.5.1. Definition
3.5.2. Derivation
3.5.3. Calculation from Bayes factors
3.5.4. Calculation from log Bayes factor
3.5.5. Calculation from log model evidences

3.6. Bayesian model averaging
3.6.1. Definition
3.6.2. Derivation
3.6.3. Calculation from log model evidences