Beta distribution

The beta distribution is defined on the interval \([0, 1]\) and is extraordinarily flexible: depending on its two shape parameters, it can take almost any shape: uniform, bell-shaped, U-shaped, J-shaped, or skewed in either direction. This makes it the natural model for proportions, probabilities, and rates.

Definition

A random variable \(X\) follows a beta distribution with shape parameters \(\alpha > 0\) and \(\beta > 0\), written \(X \sim \text{Beta}(\alpha, \beta)\), if:

\[f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}}{B(\alpha, \beta)}, \quad 0 \leq x \leq 1\]

where \(B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}\) is the beta function, which acts as a normalizing constant ensuring the PDF integrates to 1.

The CDF has no closed form for general \(\alpha, \beta\) and is computed numerically as the regularized incomplete beta function \(I_x(\alpha, \beta)\).

⚠️ The beta function vs the beta distribution

The beta function \(B(\alpha, \beta)\) is a mathematical function used as the normalizing constant in the PDF. It is not the same as the beta distribution. The beta distribution is a probability distribution; the beta function is a special function. Both use the same Greek letters, which causes confusion in textbooks.

How the shape parameters control the distribution

The behavior of the beta distribution changes dramatically depending on \(\alpha\) and \(\beta\):

  • \(\alpha = \beta = 1\): uniform distribution on \([0,1]\).
  • \(\alpha = \beta > 1\): symmetric bell shape centered at 0.5. Larger values give a narrower bell.
  • \(\alpha = \beta < 1\): U-shaped, with probability concentrated near 0 and 1.
  • \(\alpha > \beta\): right-skewed, distribution leans toward 1.
  • \(\alpha < \beta\): left-skewed, distribution leans toward 0.
  • \(\alpha > 1\), \(\beta = 1\): power function distribution, monotonically increasing.
  • \(\alpha = 1\), \(\beta > 1\): monotonically decreasing.

Beta distribution PDF for different combinations of alpha and beta parameters

Properties

For \(X \sim \text{Beta}(\alpha, \beta)\):

  1. Expected Value (Mean)

\[E(X) = \frac{\alpha}{\alpha + \beta}\]

  1. Variance

\[\text{Var}(X) = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}\]

  1. Skewness

\[\text{Skewness} = \frac{2(\beta - \alpha)\sqrt{\alpha+\beta+1}}{(\alpha+\beta+2)\sqrt{\alpha\beta}}\]

Positive when \(\alpha < \beta\) (left-skewed toward 0), negative when \(\alpha > \beta\) (right-skewed toward 1), zero when \(\alpha = \beta\).

  1. Kurtosis

\[g_2 = \frac{6\left[(\alpha-\beta)^2(\alpha+\beta+1) - \alpha\beta(\alpha+\beta+2)\right]}{\alpha\beta(\alpha+\beta+2)(\alpha+\beta+3)}\]

  1. Mode

\[\text{Mode} = \frac{\alpha - 1}{\alpha + \beta - 2}, \quad \text{for } \alpha > 1 \text{ and } \beta > 1\]

For \(\alpha \leq 1\) or \(\beta \leq 1\), the mode is at 0, 1, or both endpoints (U-shape).

  1. Quantile Function

No closed form; computed numerically via the inverse incomplete beta function.

The beta distribution as a Bayesian prior

The beta distribution is the conjugate prior for the binomial likelihood. This means: if you observe \(k\) successes in \(n\) Bernoulli trials and your prior belief about the success probability \(p\) is \(\text{Beta}(\alpha, \beta)\), then your updated (posterior) belief is:

\[p \mid k,n \sim \text{Beta}(\alpha + k,\ \beta + n - k)\]

The posterior is still a beta distribution, just with updated parameters. The parameters \(\alpha\) and \(\beta\) in the prior can be interpreted as pseudo-counts: \(\alpha - 1\) prior successes and \(\beta - 1\) prior failures.

Bayesian updating with the beta distribution

A conversion rate optimizer wants to estimate the click-through rate \(p\) of a new button design. Before seeing any data, they assume \(p \sim \text{Beta}(1, 1)\) (uniform prior: all rates equally plausible).

After showing the button to 100 visitors, 12 click it. The posterior is:

\[p \mid \text{data} \sim \text{Beta}(1 + 12,\ 1 + 88) = \text{Beta}(13, 89)\]

Posterior mean: \(13/(13+89) \approx 0.127\). Posterior mode: \((13-1)/(13+89-2) = 12/100 = 0.12\) (same as the MLE). A 95% credible interval: qbeta(c(0.025, 0.975), 13, 89) \(\approx (0.067, 0.208)\).

Example icon
Bayesian updating: flat prior Beta(1,1) updated to Beta(13,89) after observing 12 clicks out of 100 visitors

Figure 1: Bayesian updating: flat prior Beta(1,1) updated to Beta(13,89) after observing 12 clicks out of 100 visitors

Step-by-step example

A manufacturing process has a defect rate \(p\) that is modeled as \(p \sim \text{Beta}(3, 15)\), based on historical data suggesting most batches have a defect rate around 15-20%.

Expected defect rate:

\[E(p) = \frac{3}{3+15} = \frac{3}{18} \approx 0.167\]

Variance:

\[\text{Var}(p) = \frac{3 \times 15}{18^2 \times 19} \approx 0.0073\]

Standard deviation \(\approx 0.086\).

Mode (most likely defect rate):

\[\text{Mode} = \frac{3-1}{3+15-2} = \frac{2}{16} = 0.125\]

Probability that the defect rate exceeds 25%:

\[P(p > 0.25) = 1 - F(0.25) = 1 - I_{0.25}(3, 15) \approx 1 - 0.902 = 0.098\]

About 10% of batches have a defect rate above 25%.

More beta distribution examples

  • A/B test: after a test, conversion rates for two variants are modeled as \(p_A \sim \text{Beta}(40, 160)\) and \(p_B \sim \text{Beta}(55, 145)\). The probability that \(p_B > p_A\) can be computed by simulation or numerical integration - this is the core of Bayesian A/B testing.

  • Project completion rate: in project management, the fraction of tasks completed on time in similar projects follows \(\text{Beta}(8, 2)\) (mean 0.8, most projects complete 80%+ on time). Probability of completing more than 90% on time: \(P(X > 0.9) = 1 - I_{0.9}(8,2) \approx 0.264\).

  • Order statistics: the \(k\)-th order statistic of \(n\) independent \(\text{Uniform}(0,1)\) variables follows \(\text{Beta}(k, n-k+1)\).

Example icon

💡 Relationship with other distributions

  • Uniform: \(\text{Beta}(1,1) = \text{Uniform}(0,1)\).
  • Gamma: if \(X \sim \text{Gamma}(\alpha,1)\) and \(Y \sim \text{Gamma}(\beta,1)\) independently, then \(X/(X+Y) \sim \text{Beta}(\alpha,\beta)\).
  • Binomial: the beta is the conjugate prior for the binomial likelihood parameter \(p\).
  • Dirichlet: the multivariate generalization of the beta distribution, used for modeling probability vectors (e.g. topic distributions in text models).
  • Order statistics: \(\text{Beta}(k, n-k+1)\) is the distribution of the \(k\)-th order statistic from \(n\) uniform samples.